Best mysql questions in September 2011

Escaping user input from database necessary?

12 votes

So I know about MySQL injection and always escape all my user input before putting it in my database. However I was wondering, imagine a user tries to submit a query to inject, and I escape it. What if I then at a later moment take this value from the database, and use it in a query. Do I have to escape it again?

So: (sql::escape() contains my escape function)

$userinput = "'); DROP `table` --";
mysql_query("INSERT INTO `table` 
             (`foo`,`bar`) 
             VALUES 
             ('foobar','".sql::escape($userinput)."')");

// insert php/mysql to fetch `table`.`bar` into $output here

mysql_query("INSERT INTO `table2` 
            (`foo`,`bar`) 
            VALUES
            ('foobar','".$output."')");

Does MySQL automatically escape their output or something like that, or should I escape in the second query as well?

This is a testcase but this occurs in some other ways within my program and I'm wondering how tight the security has to be for cases like this.

EDIT

My escape function

static function escape($string){

    if(get_magic_quotes_gpc()) 
        $string = stripslashes($string); 

    return mysql_real_escape_string($string);

}

Does MySQL automatically escape their output or something like that, or should I escape in the second query as well?

You need to escape in the second query as well. MySQL does not do any escaping on its output.

Long answer: MySQL string escaping does not modify the string that is being inserted, it just makes sure it doesn't do any harm in the current query. Any SQL injection attempt still remains in the data.

List and Linq To Sql Performance Issue

9 votes

I Have One Table (My Sql) with 2 millions records and one List with 100 Records. I have List Except lamda expression for finding all those Urls that is in List but Not in Table.

Now Issue is that it's taking lot of time around 5 mins. I am working in powerful VPS and code and database in same server.

Please suggest me All possible way to increase the performance of linq to sql and linq to entity.

My Code Is`return

Urls.Except(DbContext.postedurllists.Select(crawl => crawl.PostedUrl).ToList()).ToList();`

Where Urls Is List Which Contain 100 Urls And postedurllists is a table that contains 2 Millions record. Thanks

You're currently pulling all of the URLs from the database. That's not a good idea. Instead, I would suggest pulling the intersection from the database by effectively passing your Urls list into the database, and doing an except based on the results:

var commonUrls = DbContext.postedurllists
                          .Select(c => c.PostedUrl)
                          .Where(url => Urls.Contains(url))
                          .ToList();

var separateUrls = Urls.Except(commonUrls);

Pizza & Food - database design

8 votes

I want to create a website that allow customer to order food from the website.

There are two food type:

  • regular food/drink (eg: burger, donner kebab, chip, coke, pepsi etc)
  • Pizza food (eg: Margherita Pizza, meat Pizza, etc)

If they select pizza from the list - they may need to select Base (thin crust, dep crust), Extras, and the pizza size/Option.

How to design the tables in this situation?

Note: Each item have 1 or more options. An option may have extra (1 or more) or without extra. If item is pizza type - then it may have Base (crust)

Screenshots prototype

See two screenshots I am trying to implement, I on the right path on the database design or what could have done better?

Pizza customize:

enter image description here

Beef Burger customize:

enter image description here

Extras functionality (Dropdown / tickboxes)

On the extras, sometime I need to add multiple extras for the dropdown instead tickboxes.. That mean customer can only choose 1 from 1, 2 or 3 dropdowns .. how do I implement like this into database? Example Below:

enter image description here

Here what I have came up with:

categories Table:

+----------+--------------+------+-----+---------+----------------+
| Field    | Type         | Null | Key | Default | Extra          |
+----------+--------------+------+-----+---------+----------------+
| cat_id   | int(11)      | NO   | PRI | NULL    | auto_increment |
| cat_name | varchar(100) | NO   |     | NULL    |                |
+----------+--------------+------+-----+---------+----------------+

items Table:

+-----------+--------------+------+-----+---------+----------------+
| Field     | Type         | Null | Key | Default | Extra          |
+-----------+--------------+------+-----+---------+----------------+
| item_id   | int(11)      | NO   | PRI | NULL    | auto_increment |
| cat_id    | int(11)      | NO   |     | NULL    |                |
| item_name | varchar(100) | NO   |     | NULL    |                |
| item_type | int(11)      | NO   |     | NULL    |                |
+-----------+--------------+------+-----+---------+----------------+

- item_type (0 = normal, 1 = pizza, 2 = set meal)

item_options Table:

+-------------+--------------+------+-----+---------+----------------+
| Field       | Type         | Null | Key | Default | Extra          |
+-------------+--------------+------+-----+---------+----------------+
| option_id   | int(11)      | NO   | PRI | NULL    | auto_increment |
| item_id     | int(11)      | NO   |     | NULL    |                |
| option_name | varchar(100) | NO   |     | NULL    |                |
| price       | decimal(6,2) | NO   |     | NULL    |                |
+-------------+--------------+------+-----+---------+----------------+

item_extras Table: (Do you think should have separate tables for pizza toppin and extras?)

+-----------+--------------+------+-----+---------+----------------+
| Field     | Type         | Null | Key | Default | Extra          |
+-----------+--------------+------+-----+---------+----------------+
| extra_id  | int(11)      | NO   | PRI | NULL    | auto_increment |
| option_id | int(11)      | NO   |     | NULL    |                |
| name      | varchar(50)  | NO   |     | NULL    |                |
| cost      | decimal(6,2) | NO   |     | NULL    |                |
+-----------+--------------+------+-----+---------+----------------+

item_pizza_base Table:

+-----------+--------------+------+-----+---------+----------------+
| Field     | Type         | Null | Key | Default | Extra          |
+-----------+--------------+------+-----+---------+----------------+
| base_id   | int(11)      | NO   | PRI | NULL    | auto_increment |
| option_id | int(11)      | NO   |     | NULL    |                |
| base_name | varchar(50)  | NO   |     | NULL    |                |
| cost      | decimal(6,2) | NO   |     | NULL    |                |
+-----------+--------------+------+-----+---------+----------------+

SQL Result:

mysql> select * from categories;
+--------+----------+
| cat_id | cat_name |
+--------+----------+
|      1 | Pizzas   |
|      2 | Burgers  |
+--------+----------+

mysql> select * from items;
+---------+--------+------------------+-----------+
| item_id | cat_id | item_name        | item_type |
+---------+--------+------------------+-----------+
|       1 |      1 | Vegetarian Pizza |         1 |
|       2 |      2 | Beef Burger      |         0 |
+---------+--------+------------------+-----------+

mysql> select * from item_options;
+-----------+---------+-------------+-------+
| option_id | item_id | option_name | price |
+-----------+---------+-------------+-------+
|         1 |       1 | 12 Inches   |  5.60 |
|         2 |       1 | 14 Inches   |  7.20 |
|         3 |       2 | 1/4lb       |  1.80 |
|         4 |       2 | 1/2lb       |  2.50 |
+-----------+---------+-------------+-------+

mysql> select * from item_extras;
+----------+-----------+-----------+------+
| extra_id | option_id | name      | cost |
+----------+-----------+-----------+------+
|        1 |         1 | Mushroom  | 1.00 |
|        2 |         1 | Pepperoni | 1.00 |
|        3 |         2 | Mushroom  | 1.00 |
|        4 |         2 | Pepperoni | 1.00 |
|        5 |         3 | Chips     | 0.50 |
|        6 |         4 | Chips     | 0.50 |
+----------+-----------+-----------+------+

As you can see extras from burger and pizza in 1 table.. should it be separated?

mysql> select * from item_pizza_base;
+---------+-----------+------------+------+
| base_id | option_id | base_name  | cost |
+---------+-----------+------------+------+
|       1 |         1 | Thin Crust | 0.00 |
|       2 |         1 | Deep Crust | 0.00 |
|       3 |         2 | Thin Crust | 0.00 |
|       4 |         2 | Deep Crust | 0.00 |
+---------+-----------+------------+------+

keep in mind, price extras for each item is not always the same. For example: Pizza size 10" will cost 1.00 for each extra but 0.50 for 12" pizzas. Also there will be a case for each pizza will have different cost of extras.

Is the database design correct or what could be improved?

Im stuck with Extras functionality - how to design table and fields for the dropdown extras? See my question "Extras functionality (Dropdown / tickboxes)"

If you need to make a dropdown, put the values for the dropdown in a table.
It's OK to put extra values in there, as long as you can easily separate out the values you need in the dropdown.

Example

table option_labels
-------------------
id    integer auto_increment PK
name  varchar(40)

table toppings
--------------
id               integer  auto_increment PK
option_label_id  integer  foreign key references option_labels(id)
item_id          integer  foreign key references items(item_id)
price            decimal(10,2)

As soon as you know the item, you can populate the dropdown using:

SELECT ol.name, t.price FROM toppings t
INNER JOIN option_labels ol ON (t.option_label_id = ol.id)
WHERE t.item_id = '$item_id'

Normalize those tables
This table has a major flaw:

mysql> select * from item_extras;
+----------+-----------+-----------+------+
| extra_id | option_id | name      | cost |
+----------+-----------+-----------+------+
|        1 |         1 | Mushroom  | 1.00 |
|        2 |         1 | Pepperoni | 1.00 |
|        3 |         2 | Mushroom  | 1.00 |
|        4 |         2 | Pepperoni | 1.00 |
|        5 |         3 | Chips     | 0.50 |
|        6 |         4 | Chips     | 0.50 |
+----------+-----------+-----------+------+

It's not normalized, put the names in a separate labels table, just like in the example above. If a row (excluding the id) is not unique then your data is not normalized and you're doing it wrong.

Because you are using a lot of joins, it's IMHO best to use InnoDB, it has some cool features to speed up joins that use PK's a lot.

No matter what people say
Don't denormalize until slowness starts.

Use indexes
Put an index on all fields named *_id.
Also put an index on alls fields that are used in where clauses often.
Do not put indexes on fields with low cardinality, so no index on a boolean or enum field!
Indexes slow down inserts and speed up selects.

MySQL aggregate function problem

8 votes

In the following example, why does the min() query return results, but the max() query does not?

mysql> create table t(id int, a int);
Query OK, 0 rows affected (0.10 sec)

mysql> insert into t(id, a) values(1, 1);
Query OK, 1 row affected (0.03 sec)

mysql> insert into t(id, a) values(1, 2);
Query OK, 1 row affected (0.02 sec)

mysql> select * from t
    -> ;
+------+------+
| id   | a    |
+------+------+
|    1 |    1 |
|    1 |    2 |
+------+------+
2 rows in set (0.00 sec)

mysql> select * from t where a < 4;
+------+------+
| id   | a    |
+------+------+
|    1 |    1 |
|    1 |    2 |
+------+------+
2 rows in set (0.00 sec)

mysql> select * from t where a < 4 having a = max(a);
Empty set (0.00 sec)

mysql> select * from t where a < 4 having a = min(a);
+------+------+
| id   | a    |
+------+------+
|    1 |    1 |
+------+------+
1 row in set (0.00 sec)

The HAVING clause is used to filter groups of rows. You reference min(a) and max(a) which (in the absence of any GROUP BY clause) aggregate over all a values in the table but then use a comparison against a single a value.

So which a value is MySQL supposed to use? All other RDBMSs that I know of would throw an error at this point however MySQL does allow this. From the docs

Standard SQL does not permit the HAVING clause to name any column not found in the GROUP BY clause unless it is enclosed in an aggregate function. MySQL permits the use of such columns to simplify calculations. This extension assumes that the nongrouped columns will have the same group-wise values. Otherwise, the result is indeterminate.

So in your case from the results you are getting it appears that it ended up using 1 as the scalar value for a but this behaviour is not guaranteed and it could equally well have used 2 or any other existing a value.

tool to generate test sql data by just providing table structure

8 votes

Is there any tool through which I can get insert query with test data by just providing create table query (table structure) ? I am expecting it should work in mysql.

Useful Resources found using google, in which this online data generator would be my personal favorite choice: http://www.generatedata.com/#generator

Features:

  • (JS-enabled) browser-friendly.
  • Many data types available: names, phone numbers, email addresses, cities, states,
  • provinces, counties, dates, street addresses, number ranges, alphanumeric strings,
  • lorem ipsum text and more.
  • Option to generate data in XML, Excel, HTML, CSV or SQL.
  • Country specific data (state / province / county) for Canada, US, Netherlands and UK.
  • Does your laundry.
  • Saves your data generation forms for later use (downloadable version & for donators only)

Others:

http://www.webresourcesdepot.com/test-sample-data-generators/

http://www.sqlmanager.net/en/products/mysql/datagenerator?gclid=CKye8-ODoasCFYpM4AodCDiCgA

What ways are there to store information about an anonymous/guest user in a database?

7 votes

Our application has an online shop among other features, and users are normally requested to register before completing a sale, creating a unique customer_ID in the process. When they return, they can log in and their contact details and transaction history are retrieved from the database.

We are now exploring what to do in the case of an 'anonymous' or 'guest' customer, opening up the online shop to customers who don't want to register, and also for sales logged in the backend application, where taking the customer's email, postal address, etc is just too time consuming. The solution has applications outside the online shop too.

Multiple companies use the same database, and the database is built on a party model structure, so we have explored a few options:

  1. Store all anonymous customers under one pre-defined customer_ID in the transaction table:
    1. customer_ID = 0 for every anonymous user, and customer_ID > 0 for every real user
      • This is straight-forward to hard-code into the application
      • But more involved to determine which customers belong to which company
      • Should details for customer_ID = 0 exist in the customer table in the database or as an object in the application?
        • If in the database, what database-level constraints can be made to ensure that it always exists?
        • If not in the database, then foreign key constraints from transaction.customer_ID to customer.customer_ID no longer work
    2. customer_ID is the same as the company party_ID
      • Easier to determine aggregate sales for each company, etc
      • This would confuse matters as it would appear that the company is its own customer, rather than other unique customers
  2. Generate a unique customer_ID for every new anonymous customer (per session)
    • What if the same physical user returns? There will be many records repeating the same sort of data; email, shipping address, etc.
  3. Use another unique key, such as email address, to refer to a customer
    • Not always reliable as people sometimes use more than one email address, or leave old addresses behind.
    • What if there is no email address to be taken, as is the case on the shop floor, pro forma invoices, etc?
  4. Some other Stack Overflow inspired solution!

Addition

A combination of #2 and #3 has been suggested elsewhere - attempt to store a single record for each customer, using the email address if possible, or a new record on every visit if not.

I should point out that we don't need to store a record for every anonymous customer, but it just seems that the relational database was built to deal with relationships, so having a NULL or a customer_ID in the transaction table that doesn't reference an actual customer record just seems wrong...

I must also stress that the purpose of this question is to determine what real-world solutions there are to recording 'casual' transactions where no postal address or email address are given (imagine a supermarket chekout) alongside online shop transactions where an email address and postal address are given whether they are stored or not.

What solutions have the SO community used in the past?

Assuming you require an e-mail address for all online orders, you could create a temporary account for every customer at the completion of each order when they are not logged in.

This can be done by using the shipping address and other information provided during checkout to fill in the account, and e-mailing a random temporary password to them (optionally flagging it to require changing on the first log-in, if that functionality is built into the website). This requires minimal effort on their part to setup the account, and allows them to sign in to check their order status.

Since the primary key in your database is the customer_id, it should not cause conflicts if they continue making new accounts with the same e-mail/address/etc, unless you have code in place to prevent duplicates already. It's rare for someone to create more than one temporary account though, since it's easier to log in with the password e-mailed to them than entering their data again.

For the backend orders, we generally create an account in the same way as above for every customer. However, if they don't have an e-mail address (or they only want to purchase by phone), we generate an account with their shipping information and a blank e-mail address (have to code an exception to not send temporary passwords/order confirmations when it's blank). The customer_id is given to them, and their shipping information and company name are stored in the account to look up and expedite future orders.

processing a large number of database entries with paging slows down with time

7 votes

I am trying to process millions of records from my table (size is about 30 GB) and I am currently doing it using paging (mysql 5.1.36). The query I use in my for loop is

select blobCol from large_table 
where name= 'someKey' and city= 'otherKey' 
order by name
LIMIT <pageNumber*pageSize>, <pageSize>

This works perfectly fine for about 500K records. I have a page size of 5000 that I am using and after page 100, the queries start slowing down dramatically. The first ~80 pages are extracted in a 2-3 seconds but after around page 130, each page takes about 30 seconds to retrieve, at least until page 200. One of my queries has about 900 pages and that would take too long.

The table structure is (type is MyISAM)
    name char(11)
    id int // col1 & col2 is a composite key
    city varchar(80) // indexed
    blobCol longblob

what can i do to speed it up? The explain for the query shows this

select_type: SIMPLE
possible_keys: city
key : city
type: ref
key_len: 242
ref: const
rows: 4293720
Extra: using where; using filesort

In case it helps, the my.cnf for my server (24 GB ram, 2 quad core procs) has these entries

  key_buffer_size = 6144M
  max_connections = 20
  max_allowed_packet = 32M
  table_open_cache = 1024
  sort_buffer_size = 256M
  read_buffer_size = 128M
  read_rnd_buffer_size = 512M
  myisam_sort_buffer_size = 128M
  thread_cache_size = 16
  tmp_table_size = 128M
  max_heap_table_size = 64M

Here is what I did, and reduced the total execution time by a factor of 10.

What I realized form the execution plan of my original query was that it was using filesort for sorting all results and ignoring the indexes. That is a bit of a waste.

My test database: 5 M records, 20 GB size. table structure same as in the question

Instead of getting blobCol directly in the first query, i first get the value of 'name' for beginning of every page. Run this query indefinitely until it returns 0 results. Every time, add the result to a list

SELECT name
FROM my_table
where id = <anyId> // I use the id column for partitioning so I need this here
order by name
limit <pageSize * pageNumber>, 1

Sine page number is not previously known, start with value 0 and keep incrementing until the query returns null. You can also do a select count(*) but that itself might take long and will not help optimize anything. Each query took about 2 seconds to run once the page number exceeded ~60.

For me, the page size was 5000 so I got a list of 'name' strings at position 0, 5001, 10001, 15001 and so on. The number of pages turned out to be 1000 and storing a list of 1000 results in memory is not expensive.

Now, iterate through the list and run this query

SELECT blobCol
FROM my_table
where name >= <pageHeader>
and name < <nextPageHeader>
and city="<any string>"
and id= 1

This will run N times, where N = size of list obtained previously. Since 'name' is the primary key col, and 'city' is also indexed, EXPLAIN shows that this calculation is performed in memory using the index.

Now, each query takes 1 second to run, instead of the original 30-40. So combining the pre-processing time of 2 seconds per page, total time per page is 3-4 seconds instead of 30-40.

If anyone has a better solution or if there is something glaringly wrong with this one, please let me know

Are identical primary keys bad practice?

7 votes

I am trying to create a site where users can register and create a profile, therefore I am using two MySQL tables within a database e.g. users and user_profile.

The users table has an auto increment primary key called user_id.

The user_profile table has the same primary key called user_id however it is not auto increment.

*see note for why I have multiple tables.

When a user signs up, data from the registration form is inserted into users, then the last_insert_id() is input into the user_id field of the user_profile table. I use transactions to ensure this always happens.

My question is, is this bad practice?

Should I have a unique auto increment primary key for the user_profile table, even though one user can only ever have one profile?

Maybe there are other downsides to creating a database like this?

I'd appreciate if anyone can explain why this is a problem or if it's fine, I'd like to make sure my database is as efficient as possible.

Note: I am using seperate tables for user and user_profile because user_profile contains fields that are potentially null and also will be requested much more than the user table, due to the data being displayed on a public profile.

Maybe this is also bad practice and they should be lumped in one table?

I find this a good approach, I'd give bonus point if you use a foreign key relation and preferably cascade when deleting the user from the user table.

As too separated the core user data in one table, and the option profile data in another - good job. Nothing more annoying then a 50 field dragonish entry with 90% empty values.

Pros/cons of MongoDB or MySQL for this purpose

7 votes

I'm looking for a bit of help or guidance on which database to use for a project. If you can raise any points, or note flaws, answer any questions or promote either database type for the purpose I'm about to spell out, I would really appreciate it.

Anyways:

  • We have some software that tracks forms.

  • We have users that can have MANY different properties, literally hundreds of settings, and I'm not a fan of MySQL tables that wide. I really like Mongo for this.

  • We have different types of forms, each can have completely different fields. Right now, we have a list of forms with generic data, then join the relevant table for additional data. I would have all of these fields in one distinct document with Mongo, and I could easily add fields without worrying.

  • We have fees, notes, history on each form. I like how in MySQL they are in a different table, and I can get history by form or by user - same as notes.

  • Our policy is pretty much keep ALL data, even deleted or pre-edited data... forever. Should I be worried about hitting a size limit? We're probably talking 100gb by the end of 2013

  • How many Mongo queries per page will bog things down? 20? 100? Would that change if I had a SSD in the server? (Right now, we have about 60 MySQL queries a page. This can be improved on.)

  • Is it a bad idea for my first Mongo project to be a somewhat major bit of software? Is it something I can learn as I go?

  • I like the case insensitivity of MySQL column names for quick and dirty things.

  • In MySQL, I break things out to different tables. Is it fine, in Mongo, to put data together that CAN be separated? Example: username, email, phone, license1 => [num,isValid], license2 => [num, isValid], notifications => [notification1...notification50000], password hash, salt, setting1, setting2...setting1000, permission1, permission2...permission1000 Of course, I'd make use of the nested style to organize, but is it best to store all this under "user" or break it out to settings, licenses, permissions? Second example: formName, address, notes => [note1 => [user,note,date], note2 => [user,note,date]]

  • Is there any problems with doing a HYBRID setup, where user data is is Mongo, and form data is in MySQL?

  • We have to run a lot of reports, are there limitations on this in Mongo? For example, would I run into problems looking for every form from the past 40 days with a fee over $10, with the fees in each row totaled up, sorted by the age of the user who filled it out?

  • Data redundancy - On the Amazon cloud, MySQL has MASSIVE amounts of redundancy. Is there any service to match that with Mongo? Is it complex to get into setting that up on my own?

  • Is MongoDB supported by any "cloud" providers? AWS does a lot for MySQL, but it looks like I'd be on my own for Mongo

Just a few things off the top of my head - I really do appreciate anything anyone has to say.

We have users that can have MANY different properties, literally hundreds of settings, and I'm not a fan of MySQL tables that wide. I really like Mongo for this.

We have different types of forms, each can have completely different fields. Right now, we have a list of forms with generic data, then join the relevant table for additional data. I would have all of these fields in one distinct document with Mongo, and I could easily add fields without worrying.

From your post i understand that your ultimate aim is to handle the users & forms that contains varying schema(aka schemaless). I believe mongodb is a right choice for this purpose.

We have fees, notes, history on each form. I like how in MySQL they are in a different table, and I can get history by form or by user - same as notes.

No problem, You can use different documents (or embedded documents based on the size of it - 16 mb is the max size of the doc) to handle this without any problems. so you can have the schema like

  Form
   - form field1
   - form field1
   - id of the fees doc
   - id of the notes doc
   - id of the history doc

or (for embedded docs)

  Form
   - form field1
   - form field2
   - embedded fees doc
             - fees field1 
             - fees field2
   - embedded notes doc
             - notes field1 
             - notes field2

Our policy is pretty much keep ALL data, even deleted or pre-edited data... forever. >Should I be worried about hitting a size limit? We're probably talking 100gb by the end of >2013

You will store as much as data you would do, already there are production deployments storing data over Terabytes.

Is it a bad idea for my first Mongo project to be a somewhat major bit of software? Is it something I can learn as I go?

Yes if you are going to use mongodb without prototyping your application model. i would recommend to implement (prototype) a minimal set of your app (like features that sucks in mysql) and learn basics and see how comfortable you are.

I like the case insensitivity of MySQL column names for quick and dirty things.

Mongo enforces the case sensitivity, because thats a nature of BSON (as well JSON) key value pairs.

In MySQL, I break things out to different tables. Is it fine, in Mongo, to put data together that CAN be separated? Example: username, email, phone, license1 => [num,isValid],

Main advantage of mongo over other sql data store is, you can store as much of relevant info within the same document (within the 16 mb size) . If you are unsure about the size or certain parts of data are growing, then you can split the part into another. Since you are concern about the no of queries, it will drastically reduce the number of requests.

Is there any problems with doing a HYBRID setup, where user data is is Mongo, and form data is in MySQL?

No absolutely not, in fact i am currently running mongodb along with mysql(for transactions alone). But if you are not handling any transactions, you can stick with mongodb.

We have to run a lot of reports, are there limitations on this in Mongo? For example, would I run into problems looking for every form from the past 40 days with a fee over $10, with the fees in each row totaled up, sorted by the age of the user who filled it out?

No i don't see any limitation in this. In fact its very fast handling queries with the proper indexes. But there are certain things you can't do with mongo like normal joins, instead you can use map/reduce to handle the data for reports.

Is MongoDB supported by any "cloud" providers? AWS does a lot for MySQL, but it looks like I'd be on my own for Mongo

Mongohq,Mongolab are some of the dedicated managed mongo hosting services available. Also redhat openshift & vmware cloundfoundry provides the hosting platforms for mongo, you can check out the mongo hosting center for more info

Hope this helps

Cheers

Why does removing this index in MySQL speed up my query 100x?

7 votes

I have the following MySQL table (simplified):

CREATE TABLE `track` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `title` varchar(256) NOT NULL,
  `is_active` tinyint(1) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `is_active` (`is_active`, `id`)
) ENGINE=MyISAM AUTO_INCREMENT=7495088 DEFAULT CHARSET=utf8

The 'is_active' column marks rows that I want to ignore in most, but not all, of my queries. I have some queries that read chunks out of this table periodically. One of them looks like this:

SELECT id,title from track where (track.is_active=1 and track.id > 5580702) ORDER BY id ASC LIMIT 10;

This query takes over a minute to execute. Here's the execution plan:

> EXPLAIN SELECT id,title from track where (track.is_active=1 and track.id > 5580702) ORDER BY id ASC LIMIT 10;
+----+-------------+-------+------+----------------+--------+---------+-------+---------+-------------+
| id | select_type | table | type | possible_keys  | key    | key_len | ref   | rows    | Extra       |
+----+-------------+-------+------+----------------+--------+---------+-------+---------+-------------+
|  1 | SIMPLE      | t     | ref  | PRIMARY,is_active | is_active | 1       | const | 3747543 | Using where |
+----+-------------+-------+------+----------------+--------+---------+-------+---------+-------------+

Now, if I tell MySQL to ignore the 'is_active' index, the query happens instantaneously.

> EXPLAIN SELECT id,title from track IGNORE INDEX(is_active) WHERE (track.is_active=1 AND track.id > 5580702) ORDER BY id ASC LIMIT 10;
+----+-------------+-------+-------+---------------+---------+---------+------+---------+-------------+
| id | select_type | table | type  | possible_keys | key     | key_len | ref  | rows    | Extra       |
+----+-------------+-------+-------+---------------+---------+---------+------+---------+-------------+
|  1 | SIMPLE      | t     | range | PRIMARY       | PRIMARY | 4       | NULL | 1597518 | Using where |
+----+-------------+-------+-------+---------------+---------+---------+------+---------+-------------+

Now, what's really strange is that if I FORCE MySQL to use the 'is_active' index, the query once again happens instantaneously!

+----+-------------+-------+-------+---------------+---------+---------+------+---------+-------------+
| id | select_type | table | type  | possible_keys | key     | key_len | ref  | rows    | Extra       |
+----+-------------+-------+-------+---------------+---------+---------+------+---------+-------------+
|  1 | SIMPLE      | t     | range | is_active     |is_active| 5       | NULL | 1866730 | Using where |
+----+-------------+-------+-------+---------------+---------+---------+------+---------+-------------+

I just don't understand this behavior. In the 'is_active' index, rows should be sorted by is_active, followed by id. I use both the 'is_active' and 'id' columns in my query, so it seems like it should only need to do a few hops around the tree to find the IDs, then use those IDs to retrieve the titles from the table.

What's going on?

EDIT: More info on what I'm doing:

  • Query cache is disabled
  • Running OPTIMIZE TABLE and ANALYZE TABLE had no effect
  • 6,620,372 rows have 'is_active' set to True. 874,714 rows have 'is_active' set to False.
  • Using FORCE INDEX(is_active) once again speeds up the query.
  • MySQL version 5.1.54

It looks like MySQL is making a poor decision about how to use the index.

From that query plan, it is showing it could have used either the PRIMARY or is_active index, and it has chosen is_active in order to narrow by track.is_active first. However, it is only using the first column of the index (track.is_active). That gets it 3747543 results which then have to be filtered and sorted.

If it had chosen the PRIMARY index, it would be able to narrow down to 1597518 rows using the index, and they would be retrieved in order of track.id already, which should require no further sorting. That would be faster.

New information:

In the third case where you are using FORCE INDEX, MySQL is using the is_active index but now instead of only using the first column, it is using both columns (see key_len). It is therefore now able to narrow by is_active and sort and filter by id using the same index, and since is_active is a single constant, the ORDER BY is satisfied by the second column (ie the rows from a single branch of the index are already in sorted order). This seems to be an even better outcome than using PRIMARY - and probably what you intended in the first place, right?

I don't know why it wasn't using both columns of this index without FORCE INDEX, unless the query has changed in a subtle way in between. If not I'd put it down to MySQL making bad decisions.

MySQL - Views - Super slow query

6 votes

This is a weird one. I am trying to use Views in MySQL (I'm reasonably new to MySQL with more experience with Sybase and SQL Server). Any way this new project we are using MySQL as it seems to have good performance. However to make querying for a web front end simpler we decided to create a few views, all work well, but they take forever to run.

The views are very simple, just select statements (these tables do have a few million rows in them). Say for example this query:

SELECT CAST(classifier_results.msgDate as DATE) AS mdate
       ,classifier_results.objClass AS objClass
       ,COUNT(classifier_results.objClass) AS obj
       ,classifier_results.subjClass AS subjClass
       ,COUNT(classifier_results.subjClass) AS subj 
FROM classifier_results 
WHERE (classifier_results.msgDate >= (curdate() - 20)) 
GROUP BY 
  CAST(classifier_results.msgDate as DATE)
  ,classifier_results.objClass
  ,classifier_results.subjClass 
ORDER BY classifier_results.msgDate DESC

When run as a normal select takes around 1.5 seconds to return a result.

However when this query is put into a view (as is) - i.e.

CREATE VIEW  V1a_sentiment_AI_current AS    
SELECT CAST(classifier_results.msgDate as DATE) AS mdate
       ,classifier_results.objClass AS objClass
       ,COUNT(classifier_results.objClass) AS obj
       ,classifier_results.subjClass AS subjClass
       ,COUNT(classifier_results.subjClass) AS subj 
FROM classifier_results 
WHERE (classifier_results.msgDate >= (curdate() - 20)) 
GROUP BY 
  CAST(classifier_results.msgDate as DATE)
  ,classifier_results.objClass
  ,classifier_results.subjClass 
ORDER BY classifier_results.msgDate DESC

The query takes about 10 times longer (22-30 seconds). So I'm thinking maybe there is some optimization or query caching that doesnt work with Views or maybe there is some setting we've missed in the MySQL config. But is there any way to speed up this view so its just a nice placeholder for this query?

Running EXPLAIN on the two queries: The normal select gives:

1, SIMPLE, classifier_results, ALL, idx_date, , , , 594845, Using where; Using temporary; Using filesort

The view select gives:

1, PRIMARY, , ALL, , , , , 100,
2, DERIVED, classifier_results, ALL, idx_date, , , , 594845, Using where; Using temporary; Using filesort

Try re-creating your view using this:

CREATE ALGORITHM = MERGE VIEW `V1a_sentiment_AI_current` AS    
SELECT CAST(classifier_results.msgDate as DATE) AS mdate
   ,classifier_results.objClass AS objClass
   ,COUNT(classifier_results.objClass) AS obj
   ,classifier_results.subjClass AS subjClass
   ,COUNT(classifier_results.subjClass) AS subj 
FROM classifier_results 
WHERE (classifier_results.msgDate >= (curdate() - 20)) 
GROUP BY 
  CAST(classifier_results.msgDate as DATE)
  ,classifier_results.objClass
  ,classifier_results.subjClass 
ORDER BY classifier_results.msgDate DESC

More information on MySQL's view processing algorithms can be found here.

How to join multiple tables related by other tables

5 votes

I'm developing a site where people can publicate their houses for rent. I'm using php 5.2.0 and MySQL 5+

the publications are stored in a table like this

ta_publications
+---+-------------+------+
|id |    name     | date |
+---+-------------+------+
| 1 | name_001    |  ... |
| 2 | name_002    |  ... |
| 3 | name_003    |  ... |
+---+-------------+------+

I have diferent publications, which have "features" such as "internet", "made service", "satellite tv", etc.

These features might change in the future, and I want to be able to add/remove/modify them, so I store them in the database in a table.

ta_features
+---+-------------+
|id | name        |
+---+-------------+
| 1 | Internet    |
| 2 | Wi-Fi       |
| 3 | satelital tv|
+---+-------------+

which are related to the publications using the following table

ta_publication_features
+---+-------------+----------------+
|id |   type_id   | publication_id |
+---+-------------+----------------+
| 1 |      1      |       1        |
| 2 |      2      |       1        |
| 3 |      3      |       1        |
+---+-------------+----------------+

I think it's pretty easy to understand; There is a publication called name_001 which have internet, wi-fi and satellite tv.

I have the same data-schema for the images, I store them in this table

ta_images
+---+-------------+
|id | src         |
+---+-------------+
| 1 | URL_1       |
| 2 | URL_2       |
| 3 | URL_3       |
+---+-------------+

And use the following table to relate them to the publications

ta_publication_images
+---+-------------+----------------+----------+
|id |  img_id     | publication_id |   order  |
+---+-------------+----------------+----------+
| 1 |      1      |       1        |    0     |
| 2 |      2      |       1        |    1     |
| 3 |      3      |       1        |    2     |
+---+-------------+----------------+----------+

the column order gives the order in wich publications should be displayed when listing a single publication.

Philipp Reichart provided me with a query that will search and get all the publications that have certain features. It works for listing the publications, I can't modified it to return me the data I need.

So I figured I'll run that query and get all of the publications that pass the search criteria and then use another query to list them.

The listing of these publications shall include all publication data (everything on ta_publications)+ all of it's features + the most important (order 0) image src.

I could, for every publication, have two simple querys wich will return, separately, the most important image and all the features it has, but when listing 25 publications per page, it'll be 1 search query + (2 querys per publication * 25 publications) = 51 different querys, clearly not very efficient.

EDIT:

My question is, how can I create a SQL query that, given some publication ids, will return: all publication data (everything on ta_publications) + all of it's features + the most important (order 0) image src

You'll get redundant publication and image data with this one, but here is a way to do it with one query:

   SELECT p.id, p.name, p.date,
           f.id, f.name,
           i.id, i.src
    FROM ta_publications p
    JOIN ta_publication_features pf ON p.id = pf.publication_id
    JOIN ta_features f ON f.id = pf.type_id
    JOIN ta_publication_images pi ON p.id = pi.publication_id 
         AND pi.order = 0
    JOIN ta_images i ON i.id = pi.img_id
    WHERE p.id IN (  -- list of publication ids );

Odd behaviour when doing LIKE with wildcards searching for backslash in MySQL

5 votes

I've encountered a very unusual problem with MySQL, involving backslashes. Basically when I do a wildcard match with LIKE for \n which is in the database as text rather than an actual newline, it will only return a match if I have just a right hand wildcard:

SELECT * 
FROM  `tmptest` 
WHERE  `a` LIKE  '\\\\n%'

Now, if I query like this, it will not return anything:

SELECT *  
FROM `tmptest` 
WHERE `a` LIKE '%\\\\n%'

As you can see from the data I have in the table both queries should be matching. I'm not sure if it's something that I'm missing, or I'm incorrectly escaping the newline but it doesn't make sense for the first query to work and the second not.

Table structure:

CREATE TABLE IF NOT EXISTS `tmptest` (
`a` varchar(22) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

Sample data:

INSERT INTO `tmptest` (`a`) VALUES
('\\n'),
('\\n\\ndfsdfsdfs\\n');

Thanks for taking time to read this.

It works for me with 6 backslashes when using the left side wildcard:

mysql> SELECT * FROM `tmptest` WHERE `a` LIKE '%\\\\\\n%';
+-----------------+
| a               |
+-----------------+
| \n              |
| \n\ndfsdfsdfs\n |
+-----------------+
2 rows in set (0.00 sec)

Using mysqld Ver 5.1.49


@Karolis as far as i understand the expression for the LIKE operator should be parsed twice, hence \\\\ turns into \ when used with LIKE.

But how to explain this (using the expression "back\slash"):

SELECT 'back\\slash' LIKE '%back\\\\slash%'; -> TRUE (normal behaviour)

SELECT 'back\\slash' LIKE '%back\\\\\slash%'; -> TRUE (5 backslashes?)

SELECT 'back\\slash' LIKE '%back\\\\\\slash%'; -> TRUE (6 backslashes?)

SELECT 'back\\slash' LIKE '%back\\\\\\\slash%'; -> TRUE (7 backslashes?)

SELECT 'back\\slash' LIKE '%back\\\\\\\\slash%'; -> FALSE (normal behaviour, I guess..)

Nevertheless if searching only for the "\":

mysql> SELECT 'back\\slash' LIKE '%\\\\%'; --> FALSE (but should work)

mysql> SELECT 'back\\slash' LIKE '%\\\\\%'; --> TRUE (5 backslashes)

mysql> SELECT 'back\\slash' LIKE '%\\\\\\%'; --> TRUE (6 backslashes)

mysql> SELECT 'back\\slash' LIKE '%\\\\\\\%'; --> FALSE (7 backslashes)

For this particular question, one could use a different escape character | and bypass the problem altogether (if no | character occurs):

mysql> SELECT 'back\\slash' LIKE '%\\%' ESCAPE '|'; --> TRUE

So maybe some mysql guru out there can explain this. I simply can't. also tested with mysql 5.1.53 on a different machine. Same behaviour was observed. As i started by commenting, its a rather interesting question.