Best mysql questions in March 2011

Why are composite primary keys still around?

48 votes

I'm assigned to migrate a database to a mid-class ERP. The new system uses composite primary keys here and there, and from a pragmatic point of view, why?

Compared to autogenerated IDs, I can only see negative aspects;

  • Foreign keys becomes blurry
  • Harder migration or db-redesigns
  • Inflexible as business change. (My car has no reg.plate..)
  • Same integrity better achieved with constraints.

It's falling back to the design concept of candiate keys, which I neither see the point of.

Is it a habit/artifact from the floppy-days (minimizing space/indexes), or am I missing something?

//edit// Just found good SO-post: Composite primary keys versus unique object ID field //

Personally I prefer the use of surrogate keys. However, in joining tables that consist only of the ids from two other tables (to create a many-to-many relationships) composite keys are the way to go and thus taking them out would make things more difficult.

There is a school of thought that surrogate keys are always bad and that if you don't have uniqueness to record through the use of natural keys you have a bad design. I strongly disagree with this (if you aren't storing SSN or some other unique value I defy you to come up with a natural key for a person table for instance.) But many people feel that it is necessary for proper normalization.

Sometimes having a composite key reduces the need to join to another table. Sometimes it doesn't. So there are times when a composite key can boost performance as well as times when it can harm performance. If the key is relatively stable, you may be fine with faster performance on select queries. However, if it is something that is subject to change like a company name, you could be in a world of hurt when company A changes it's name and you have to update a million associated records.

There is no one size fits all in database design. There are time when composite keys are helpful and times when they are horrible. There are times when surrogate keys are helpful and times when they are not.

Create a summary row for data across multiple tables

9 votes

I'm trying to write a SQL query to generate a summary row for the actions performed by a given user in a given period. I have the following relevant table structure:

users

  • id
  • team

audit_periods (can be processing, shipping, break, etc)

  • user_id
  • period_type (can be "processing", "shipping", etc -- not currently normalized)
  • started_at
  • finished_at (can be null for the current period, hence the logic around times below)

audit_tasks

  • audit_period_id
  • audit_task_type_id
  • created_at
  • score

audit_task_types

  • name ("scan", "place_in_pallet", etc)
  • score (seems redundant, but we need to maintain the score that the audit_task received at the time it was performed, as the audit_task_type score can change later)

ER Diagram

For each user for a given period, I'd like to create something like the following row of data:

users.id users.email time_spent_processing time_spent_shipping ... number_of_scans number_of_pallets

which would be calculated by figuring out for each user:

  • What audit_periods fall at least partially in the desired window? (Uses started_at and finished_at.)
  • How long did a user spend in each type of audit_period? (Should involve group by audit_periods.period_type, I'd imagine.)
  • What audit_tasks fall within the desired window? (Uses created_at -- not in the code below yet.)
  • How many of each type of audit_task did a user accomplish during the window? (Joins out to audit_task_type, and likely involves a group by on audit_task_types.name.)
  • How many points were earned during the time period? (Sums the scores of all the audit_tasks in the window.)

I've exhausted all of the SQL tricks I know (not many) and came up with something like the following:

select 
    u.id as user_id,
    u.email as email,
    u.team as team,
    ap.period_type as period_type,
    att.name,
    time_to_sec(
      timediff(least("2011-03-17 00:00:00", ifnull(ap.finished_at, utc_timestamp())), greatest("2011-03-16 00:00:00", ap.started_at))
    ) as period_duration,
    sum(at.score) as period_score
  from audit_periods as ap
  inner join users as u on ap.user_id = u.id
  left join audit_tasks as at on at.audit_period_id = ap.id
  left join audit_task_types as att on at.audit_task_type_id = att.id
  where (ap.started_at >= "2011-03-16 00:00:00" or (ap.finished_at >= "2011-03-17 00:00:00" and ap.finished_at <= "2011-03-17 00:00:00"))
    and (ap.finished_at <= "2011-03-17 00:00:00" or (ap.started_at >= "2011-03-16 00:00:00" and ap.started_at <= "2011-03-16 00:00:00"))
    and u.team in ("Foo", "Bar")
  group by u.id, ap.id, at.id

but this seems to be functionally equivalent to just selecting all of the audit tasks in the end. I've tried some subqueries as well, but to little avail. More directly, this generates something like (skipping less important columns):

user_id   |   period_type   |   period_duration  |  name            |   score
1             processing        1800s               scan                200
1             shipping          1000s               place_in_pallet     100
1             shipping          1000s               place_in_pallet     100
1             break             500s                null                null

when I want:

user_id   |   processing    |   shipping  |  break  |  scan  |  place_in_pallet  |  score
1             1800s             1000s        500s      1        2                   400

I can easily fetch all of the audit_tasks for a given user and roll them up in code, but I might be fetching hundreds of thousands of audit_tasks over a given period, so it needs to be done in SQL.

Just to be clear -- I'm looking for a query to generate one row per user, containing summary data collected across the other 3 tables. So, for each user, I want to know how much time he spent in each type of audit_period (3600 seconds processing, 3200 seconds shipping, etc), as well as how many of each audit_task he performed (5 scans, 10 items placed in pallet, etc).

I think I have the elements of a solution, I'm just having trouble piecing them together. I know exactly how I would accomplish this in Ruby/Java/etc, but I don't think I understand SQL well enough to know which tool I'm missing. Do I need a temp table? A union? Some other construct entirely?

Any help is greatly appreciated, and I can clarify if the above is complete nonsense.

You will need to break this up into two crosstab queries which give you the information about audit_periods by user and another query that will give you the audit_task information by user and then join that to the Users table. It isn't clear how you want to roll up the information in each of the cases. For example, if a given user has 10 audit_period rows, how should the query roll up those durations? I assumed a sum of the durations here but you might want a min or max or perhaps even an overall delta.

Select U.user_id
    , AuditPeriodByUser.TotalDuration_Processing As processing
    , AuditPeriodByUser.TotalDuration_Shipping As shipping
    , AuditPeriodByUser.TotalDuration_Break As break
    , AuditTasksByUser.TotalCount_Scan As scan
    , AuditTasksByUser.TotalCount_Place_In_Pallet As place_in_pallet
    , AuditTasksByUser.TotalScore As score
From users As U
    Left Join   (
                Select AP.user_id
                    , Sum( Case When AP.period_type = 'processing' 
                                Then Time_To_Sec( 
                                        TimeDiff( 
                                            Coalesce(AP.started_at, UTC_TIMESTAMP()), AP.finished_at ) ) ) 
                        As TotalDuration_Processing
                    , Sum( Case When AP.period_type = 'shipping' 
                                Then Time_To_Sec( 
                                        TimeDiff( 
                                            Coalesce(AP.started_at, UTC_TIMESTAMP()), AP.finished_at ) ) ) 
                        As TotalDuration_Shipping
                    , Sum( Case When AP.period_type = 'break' 
                                Then Time_To_Sec( 
                                        TimeDiff( 
                                            Coalesce(AP.started_at, UTC_TIMESTAMP()), AP.finished_at ) ) ) 
                        As TotalDuration_Break
                From audit_periods As AP
                Where AP.started_at >= @StartDate 
                    And AP.finished_at <= @EndDate
                Group by AP.user_id
                ) As AuditPeriodByUser
            On AuditPeriodByUser.user_id = U.user_id
    Left Join   (
                Select AP.user_id
                    , Sum( Case When AT.Name = 'scan' Then 1 Else 0 End ) As TotalCount_Scan
                    , Sum( Case When AT.Name = 'place_in_pallet' Then 1 Else 0 End ) As TotalCount_Place_In_Pallet
                    , Sum( AT.score ) As TotalScore
                From audit_tasks As AT
                    Join audit_task_types As ATT
                        On ATT.id = AT.audit_task_type_id
                    Join audit_periods As AP
                        On AP.audit_period_id = AP.id
                Where AP.started_at >= @StartDate 
                    And AP.finished_at <= @EndDate
                Group By AP.user_id
                ) As AuditTasksByUser
        On AuditTasksByUser.user_id = U.user_id

RESTful MySQL / Terminology / Passing Parameters / Returning Ints & Doubles

8 votes

So, in an attempt to create a RESTful frontend to a MySQL database, I've briefly looked at phprestql (easy & simple, but just too simple) and now I'm attempting to build it onto NetBeans' tutorial. I've got the basic tutorial completed and working with my database just fine. However, I'm trying to figure out how to customize it a bit.

  1. All the results in JSON seem to be strings, even though in the MySQL table properties are Big Ints, Ints, and Doubles. The types also seem to be set correctly within the netbeans sources as well. However, JSON returns everything as strings. Any ideas where to address this? (Again, I'm just working from the tutorial above, albeit with my DB.)

  2. I'm also trying to figure out how I can implement additional parameters in the URI, to further refine the DB results. (http://localhost/the_db/people_table/?gender_property=male&updated_property=2011-01-18) ... would return all people rows that fit those criteria. Part of my problem is I'm not even sure of the proper terminology for this kind of feature, so it's making it a little difficult to find examples and tutorials on it.

  3. This may be related to the previous item, but I'd also like to use the URI to "drill-down" into the table/row/property to return individual values (in JSON) ... (http://localhost/the_db/people_table/42/lastname) ... would return {"Jones"}

Part of the problem is that I barely know Java from Ruby from Python. I'm pretty familiar with breaking things in Objective-C, PHP, and Perl though. However, tutorials for quick and easy Restful MySQL services with those don't seem very popular or prevalent.

[EDIT]

To the extent that this helps answer question #1, I'm attaching some of the java methods to indicate how the numerical properties are set/retrieved ... from what I can tell the actual JSON generation is automated by some library. I don't see it in here:

/** in the MySQL CommitteeObj table, the committeeId is set as follows */
/* `committeeId` bigint(11) NOT NULL auto_increment */

/** in committee.java */
public class committee implements Serializable {
    private static final long serialVersionUID = 1L;
    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    @Basic(optional = false)
    @Column(name = "committeeId")
    private BigInteger committeeId;
//.....
}

public committee(BigInteger committeeId) {
    this.committeeId = committeeId;
}

@Override
    public String toString() {
        return "texlege.committee[committeeId=" + committeeId + "]";
    }

/** in committeeConverter.java */

@XmlElement
public BigInteger getCommitteeId() {
    return (expandLevel > 0) ? entity.getCommitteeId() : null;
}

public void setCommitteeId(BigInteger value) {
    entity.setCommitteeId(value);
}

/** in committeeResource.java */

@GET
    @Produces({"application/json"})
    public committeeConverter get(@QueryParam("expandLevel")
                                  @DefaultValue("1")
    int expandLevel) {
        return new committeeConverter(getEntity(), uriInfo.getAbsolutePath(), expandLevel);
    }

protected committee getEntity() {
        try {
            return (committee) em.createQuery("SELECT e FROM committee e where e.committeeId = :committeeId").setParameter("committeeId", id).getSingleResult();
        } catch (NoResultException ex) {
            throw new WebApplicationException(new Throwable("Resource for " + uriInfo.getAbsolutePath() + " does not exist."), 404);
        }
    }
}

And here's the output from a query of a specific committee. Notice the distinct lack of JSON numbers for the committeeId, committeeType, and parentId properties:

{
    "@uri":"http://localhost:8080/TexLegeRest/rest/committees/2735/",
    "clerk":"Amy Peterson",
    "committeeId":"2735",
    "committeeName":"Appropriations",
    "committeeType":"1",
    "parentId":"-1",
    "updated":"2011-02-20T00:00:00-06:00",
}

In short, this answer may not be what you are looking for at all as it's nothing to do with NetBeans. However it does provide a different way of doing what you want in providing a RESTful interface to a MySQL database.

I have uploaded a zip file with 4 Java files, 2 XML files and 1 text file to support this solution, otherwise the answer would have been very long.

In short this is a Maven/Java/Spring/Hibernate/MySQL solution, the reason being is that I have been using this architecture recently and found it quite simple and powerful to do what is really just converting SQL ↔ JSON!

This solution also uses a few other tools like Maven for compiling/packaging/deploying rather than an IDE, which in my opinion removes a level of complexity, but might put a few IDE-loving people off.

System Configuration

So firstly you will need to download and unzip/install Java and Maven if you don't have those already. I'll also assume Windows, mainly because that is what I am currently using. I have these installed the above applications in the following locations:

c:\apps\java\jdk1.6.0_24
c:\apps\apache-maven-3.0.3

Since there is no IDE in this solution, the application is built and run from the command line. There is a tiny amount of configuration here, so just execute the following to set up some environment variables:

set JAVA_HOME=c:\apps\java\jdk1.6.0_24 Enter

set M2_HOME=c:\apps\apache-maven-3.0.3 Enter

set PATH=%PATH%;%M2_HOME%\bin;%JAVA_HOME%\bin Enter

Typing mvn --version can then be used to verify that Java and Maven are installed and found correctly.

Project Creation

Create a directory for your source, let's use c:\src\project1

On the command line again, navigate to that directory and execute:

mvn archetype:generate -DgroupId=my.group -DartifactId=project1 -DarchetypeArtifactId=maven-archetype-quickstart

Maven will download some standard libraries and eventually prompt you to "Define value for property 'version':" - just Enter to continue. Maven will then ask you to confirm the project settings so just hit Enter again to confirm. You will end up with a directory structure in which you will find a pom.xml file and two Java files. The Project Object Model (POM) file tells Maven how to build/test/package/deploy (and more) your project. You need to add some libraries to that file so that we can use Spring, JSON, Jetty and other functionality. So edit the pom.xml adding the following to the XML structure:

Under <project> element (i.e. as a sibling of the <url> element) add:

<properties>
  <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  <spring.version>3.0.5.RELEASE</spring.version>
</properties>
<repositories>
  <repository>
    <id>JBoss</id>
    <url>https://repository.jboss.org/nexus/content/groups/public/</url>
    <releases>
      <enabled>true</enabled>
      <updatePolicy>always</updatePolicy>
      <checksumPolicy>warn</checksumPolicy>
    </releases>
  </repository>
</repositories>

Under <dependencies> element add the contents of the dependencies.txt file from the zip file linked above. Those changes will allow Maven to find the latest Hibernate and Spring libraries which are not always present in the default Maven repositories and also other libraries like the HSQLDB - an in-memory database used to test this example and JSON ↔ Java conversion.

Also under the <project> element (I added this just after the </dependencies> element) add the following:

<build>
  <plugins>
    <plugin>
      <groupId>org.mortbay.jetty</groupId>
      <artifactId>maven-jetty-plugin</artifactId>
      <version>6.1.26</version>
      <configuration>
        <contextPath>/${project.artifactId}</contextPath>
        <scanIntervalSeconds>10</scanIntervalSeconds>
        <webXml>${project.build.directory}/${project.build.finalName}/WEB-INF/web.xml</webXml>
      </configuration>
    </plugin>
  </plugins>
</build>

This is the embedded web server that we will use to run the .war file you are about to build, which leads us to the final change in the pom.xml… the <packaging> element near the top of the file needs to be changed to war instead of jar.

On the command line again, navigate to the project directory you just created where the pom.xml is (probably cd project1) and type mvn compile. Maven should download all he new libraries we just added in the POM and hopefully compile without error. Now we need to just configure Spring to wire up all the RESTful URLs & beans and configure the web application itself.

Create 2 new directories under /src/main called resources and webapp/WEB-INF. Your directory structure should now look like this:

src
src/main
src/main/java
src/main/resources
src/main/webapp/WEB-INF

In the resources add the file called applicationContext.xml from the zip file. The ApplicationContext is the configuration for the application.

In the WEB-INF directory add the file called web.xml from the zip file. The web.xml describes how a web container (e.g. Tomcat or in our case Jetty) should deploy the application.

Now we need to add in some code! Instead of adding the code here and making this answer longer than it already is, the zip file contains 4 classes. Simply copy those into the src/main/java/my/group directory, overwriting App.java in the process.

Compilation and Execution

This is where you should cross your fingers… as you should be able to use mvn compile to compile the classes and then if successful mvn jetty:run-war to run the web server with the application war file. If there are no errors in starting the application, there should be some logging that looks like INFO: Mapped URL path [/people] onto handler 'app' as the initialization of Jetty finishes.

Testing the REST Interface

Now we can test the RESTful URLs. I recommend using the Poster addon for Firefox (not compatible with Firefox 4 though), so install this and we can use it to do PUT and GET requests on the project1 web-app. Once installed either select Tools → Poster or Ctrl+Alt+P.

Firstly, since we are using Spring content negotiation(scroll down to the Content Negotiation section) you will need to configure Poster to add the correct Content Type. Just add application/json to this field. To add a person to our database, just add

{"firstName" : "foo", "lastName" : "bar"}

to the body (this is the large area in the Poster addon) and use the PUT button. You should get a reponse back from the web-app and see logging on the command window. The response should be:

{"name":"foo bar","id":1,"height":1.8}

This is valid JSON and you can see integers and doubles are appearing just fine. If you have a look at the Person.java class from the zip file, you can see that firstName and lastName are the names of the actual class members which match the names of the JSON keys that were PUT. I have added a @JsonIgnore annotation to those and created a different @JsonProperty to return the full name instead. In practice you probably would not do this otherwise it would be difficult to update just the first or last name but in this example I am just using it to show that you have full control of the JSON entities returned and their names/values. Also note the Person class has a hard-coded Double (the height member) to demonstrate that numbers are serialized correctly to JSON.

You can then retrieve person 1 by changing the URL to http://localhost:8080/project1/people/1 and using the GET button instead, which just returns the same JSON.

Swapping HSQLDB for MySQL

You might have noticed that there is no MySQL database so far. You will need to change some of the configuration to point to a MySQL database instead of the in-memory HSQL database that was used up until now. The "dataSource" bean properties should be updated like so:

<bean id="dataSource" class="org.apache.commons.dbcp.BasicDataSource" destroy-method="close"
  p:driverClassName="com.mysql.jdbc.Driver"
  p:url="jdbc:mysql://localhost:3306/name_of_your_database_instance"
  p:username="your_database_username"
  p:password="your_database_password"/>

where you need to specify the connection details of your database.

Lastly the hibernate dialect needs updating to be the MySQL one, so replace the org.hibernate.dialect.HSQLDialect with

<prop key="hibernate.dialect">org.hibernate.dialect.MySQL5InnoDBDialect</prop>

in the applicationContext.xml file.

Be warned that when Hibernate is configured with the following property <prop key="hibernate.hbm2ddl.auto">create</prop> it will destroy the database when the web application is started, since Hibernate is scanning the Person class and creating the table from the @Table, @Entity and @Column (and other) annotations.

Please note that unfortunately I've not tested this last part as I don't have MySQL installed but hopefully it should work.

Answering your questions (or not)

1) No idea, sorry. It looks like just a conversion/serialization problem which could probably be solved looking at what beans are used in the tutorial and checking their documentation.

2) and 3) With the solution above you can add as many path variables as you require, e.g.

@RequestMapping(value = "/people/gender/{gender}/updated/{lastUpdated}", method = RequestMethod.GET)
@ResponseBody
public Person findByGenderAndUpdated(@PathVariable String gender, @PathVariable String lastUpdated) {}

It might not be practical to create an API to expose the individual properties. Returning full resources by your URLs is more practical and then just let the downstream component to pick out the lastName from a Person JSON object if that is all it needs. However I could see the need for a cut-down JSON representation of a Person if there are is a lot of data. i.e. it is more efficient from a bandwidth perspective to exlucde certain large data properties.

You would have to implement these methods yourself for each combination. This effectively constitutes your RESTful API. If you need to create a document to describe your API then the Atlassian guidelines are very well written.

Summary

There are many variations on this solution and in practice you should put the classes in a better directory structure (model, view, controller) and create some Java class templates for handling the persistence, since all models will probalby need a "save", "find" method for example.

I hope this solution is useful to someone :-)

create mention like twitter or convore with php

7 votes

hello im just curious. about how they do stuff. what i assume they do something like this

@someone1 im stacking on stackoverflow RT @someone2 : hello guys what are you doing?

before i do it in my way i want to tell you about my database scheme

// CID = COMMENT ID, BID  = BLOG ID, UID = USER ID
CID    BID   UID    COMMENT
1       1     1      @someone1 im stacking on stackoverflow RT @someone2 : ....
2       1     4      @someone1 im stacking on stackoverflow RT @someone2 : ....
3       1     12     @someone1 im stacking on stackoverflow RT @someone2 : ....
  1. they use regex to do like this to take the @someones name

    preg_match_all("/@[a-zA-Z0-9_]+/", $text, $matches);
    
  2. then they get the @ off each name

    foreach ($matches as $value) {
    foreach ($value as $value) {
        $usernames[] = substr($value, 1);
    }
    }
    
  3. then they get the UID from the database from doing something like this

    foreach ($username as $value) {
    # insert database one by one ? so it will be like the example above
    }
    

then we can just output the comment buy geting the UID.

then somhow we can get all the comments in the blog. ( without a same comment ) where blog buid = 1 and give them an notification on every user by where uid = :uid.

is there any better way doing this ? something like twitter or convore ?

Thanks for looking in

Adam Ramadhan

I have done something similar to this with an in-house application that we use for communication.

Basically, you are going to have two tables: status_updates and mentions. Each status update has many mentions. Whenever someone creates a status update, you save it to the status_updates table. During this process, you can also use Regex to detect any @username "mentions". When you find a mention, you add it to your mentions table. For example, your mentions table might look something like this:

 mention_id (Auto-incrementing key) | status_message_id | username_id

That way if you want to see if someone is mentioned in a status message you can do a quick lookup in the status_messages table, as opposed to loading up the status message and running the Regex each time. The other nice thing about this approach is that it allows you to have multiple mentions in each status message. Just create a record in mentions for each.

That's the basic way that we have set it up.

EDIT: If you wanted to pull an "activity feed" for a given user, showing only the status updates in which they have been mentioned, it would be as simple as:

SELECT * FROM mentions m LEFT JOIN status_messages s ON m.status_message_id = s.id WHERE m.username_id = $username_id

I should note that this is not how they do it at Twitter, because they are dealing with issues of scale that would make this simple way of doing things impossible. However, I think this is the simplest solution that works well if you aren't worried about scaling to hundreds of thousands of users. If you are, then you probably have more issues on your hands than this.

What type of Join to use?

7 votes

I've got a core table and and 3 tables that extend the 'core' table in different ways.

I'm working with MLS data and I have a 'common' table that contains information common to all mls listings and then a table that has specifically "residential" information, one for "commercial",etc... I have been using mls number to join a single table when I know a listing when the property type is known, but for searching I want to join all of them and have the special fields available for search criteria (not simply searching the common table).

What type of join will give me a dataset that will contain all listings (including the extended fields in the idx tables) ?

For each Common table record there is a single corresponding record in ONLY ONE of the idx tables.

                     ___________
                    |           |
                    |  COMMON   |
                    |           |
                    |___________|
                         _|_   
                          |
       ___________________|_____________________
     _|_                 _|_                   _|_
 _____|_____         _____|______           ____|______ 
|           |       |            |         |           |
|   IDX1    |       |   IDX2     |         |   IDX3    |
|           |       |            |         |           |
|___________|       |____________|         |___________|

If you want everything in one row, you can use something like this format. Basically it gives you all the "Common" fields, then the other fields if there is a match otherwise NULL:

SELECT  Common.*,
        Idx1.*,
        Idx2.*,
        Idx3.*
FROM Common
LEFT JOIN Idx1
    ON Idx1.MLSKey = Common.MLSKey
LEFT JOIN Idx2
    ON Idx2.MLSKey = Common.MLSKey  
LEFT JOIN Idx3
    ON Idx3.MLSKey = Common.MLSKey

Bear in mind it's better to list out fields than to use the SELECT * whenever possible...

Also I'm assuming MySQL syntax is the same as SQL Server, which is what I use.

How does Amazon RDS backup/snapshot actually work?

7 votes

I am an Amazon RDS customer and am experiencing daily amazon RDS write latency spikes, corresponding roughly to the backup window. I will also see spikes at the end of a snapshot (case in point: running a snapshot takes appx 1 hour, and in the final 5 minutes, write latency spikes). I am running a multi-AZ m1.large deployment.

Is there anyone on Stack who can explain how Amazon RDS backup is actually working? I've read the Amazon RDS docs, and as far as I can tell, Amazon RDS is not behaving according to spec. Specifically, these backup/snapshot operations should be hitting my replica, and therefore not causing any downtime/performance hit, or so I thought.

I can distill my problem into six questions:

  • What is technically happening during a snapshot and a backup, and how are they different? (If you answer this question, please tell me if you are able to empirically confirm your answer, or are simply quoting me documentation).
  • Is a spike in write latency to be expected during the backup window on a multi-AZ deployment?
  • Is a spike in write latency to be expected at the end of a snapshot on a multi-AZ deployment?
  • Would my write latency spike be even higher if I was not multi-AZ ?
  • Architecturally, would I be able to avoid these write latency spikes if I rolled my own database running on two m1.large EC2 instances?
  • Are there any configurations I can use that would avoid these write latency spikes while still hosting my DB with RDS, or am I effectively at the mercy of Amazon?

Bonus Question: where and how do you host your mysql database?

I can say that I have been generally happy with RDS except for these daily write latency issues. I love the built-in database monitoring and it was fairly simple to setup and get going.

Thanks!

amazon RDS write latency

We also run several RDS instances, in addition to MySQL on some machines that we manage ourselves. I can't comment specifically, as I'm not an Amazon engineer, but several things I've learned that might explain what you're seeing:

  • Although Amazon does not share the backend details 100%, we strongly suspect that they are using their EBS system to back RDS databases.

  • This article helps explain EBS limitations and snapshot functionality http://blog.rightscale.com/2008/08/20/amazon-ebs-explained/ Again, while it's not explicit, it would make sense for Amazon to be using this infrastructure to provide RDS services.

  • Typically, a MySQL backup, in contrast to a snapshot, involves using a tool like mysqldump to create a file of SQL statements that will then reproduce the database. The database does not need to be frozen to do this. With an EBS backend, the best practice is to freeze the database (pause all transactions) while you are snapshotting to avoid data corruption.

  • The spikes you're seeing at the ends of the backup window. If replication is paused by Amazon during the snapshot of your replica, the replica would then need to "catch up" on transactions when the snapshot was complete. This would cause a latency spike.

  • Replication across a multi-AZ deployment is inherently slower then a single AZ deployment. The price you pay for better redundancy.

Alphabetically ordering records with "The", "A", "An" etc at the beginning of varchar field

7 votes

I'm looking for both MySQL and PostgreSQL solutions for this kind of problem.

Say I have a number of records with a title field. The titles are book or movie titles, like "The Cat in the Hat" and "Robin Hood". But while the titles must be displayed in their original form, they ought to be sorted in the way that libraries sort them, which is by moving any article, like "The" or "An" to the end of the title.

So "The Cat in the Hat" is sorted as if it were "Cat in the Hat, The".

What's the best way either to design the schema or write the query so that these records are sorted by title in the same way that libraries sort the title? (I also wish I knew the technical term for this type of ordering by title.) Also, what performance considerations should I be aware of and what indexes should I create?

Create a custom function that (sortableTitle, perhaps?) that will modify strings starting with your unwanted words. Finish your query statement with order by sortableTitle(title). This will incur an extra CPU cost, though you'll have to benchmark to know how much so.

You could create an extra column (sortTitle) that is populated by a trigger. This will take up some space, but then your server will be able to sort rows by an index.

Excepting the above, you cannot (without modifying the database server code) directly create an index that is in the order you want. As far as I can tell, that applies to both MySQL and PostgreSQL.

Multithreaded Delphi database application failing with large amounts of data

6 votes

Overview of the application:
I have a Delphi application that allows a user to define a number of queries, and run them concurrently over multiple MySQL databases. There is a limit on the number of threads that can be run at once (which the user can set). The user selects the queries to run, and the systems to run the queries on. Each thread runs the specified query on the specified system using a TADOQuery component.

Description of the problem:
When the queries retrieve a low number of records, the application works fine, even when lots of threads (up to about 100) are submitted. The application can also handle larger numbers of records(150,000+) as long as only a few threads (up to about 8) are running at once. However, when the user is running more than around 10 queries at once (i.e. 10+ threads), and each thread is retrieving around 150,000+ records, we start getting errors. Here are the specific error messages that we have encountered so far:

a: Not enough storage is available to complete this operation
b: OLE error 80040E05
c: Unspecified error
d: Thread creation error: Not enough storage is available to process this command
e: Object was open
f: ODBC Driver does not support the requested properties

Evidently, the errors are due to a combination of factors: number of threads, amount of data retrieved per thread, and possibly the MySQL server configuration.

The main question really is why are the errors occurring? I appreciate that it appears to be in some way related to resources, but given the different errors that are being returned, I'd like to get my head around exactly why the errors are cropping up. Is it down to resources on the PC, or something to do with the configuration of the server, for example.

The follow up question is what can we do to avoid getting the problems? We're currently throttling down the application by lowering the number of threads that can be run concurrently. We can't force the user to retrieve less records as the queries are totally user defined and if they want to retrieve 200,000 records, then that's up to them, so there's not much that we can do about that side of things. Realistically, we don't want to throttle down the speed of the application because most users will be retrieving small amounts of data, and we don't want to make the application to slow for them to use, and although the number of threads can be changed by the user, we'd rather get to the root of the problem and try to fix it without having to rely on tweaking the configuration all the time.

It looks you're loading a lot of data client-side. They may require to be cached in the client memory (especially if you use bidirectional cursors), and in a 32 bit application that could not be enough, depending on the average row size and how efficient is the library to store rows. Usually the best way to accomplish database work is to perform that on the server directly, without retrieving data to the client. Usually databases have an efficient cache system and can write data out to disk when they don't fit in memory. Why do you retrieve 150000 rows at once? You could use a mechanism to transfer data only when the user actually access them (sort of paging through data), to avoid large chunks of "wasted" memory.

mysql_real_escape_string shortcut

6 votes

i have an array with around 10 key values. which is the best way to add mysql_real_escape_string to all of them ?

$escaped_array=array_map('mysql_real_escape_string',$array);

Look at array_map

MySQL in star topology

6 votes

I have one central database with all the data in MySQL 5.1-lastest-stable.
I want to hook up multiple clients in a master-master relationship.

Question

How do I setup a star topology with 1 central server in the middle with multiple client-databases so that changes in one client get propagated first to the central server and from there to all the other client-databases?

Database info

I'm using inno-db for all the tables and I've enabled the binary-log.
Other than that I've learned how to do master-master between to databases.
All tables have primary keys primary integer autoincrement. Where the autoincrements offset and start is tuned to different client-databases never have primary key conflicts.

Why do I want this

I have client software (not a website or php) that connects to a local MySQL database on the laptop, this needs to sync to a central database, so that all folks using the program on their laptop see all the other changes that other folks make.
I do not want to connect directly against the central database because if the internet connection drops between the laptop and the central database my application dies.
In this setup the application continues, the laptop just does not get updates from other people until the connection to the central database is reestablished.

Given the requirement to use MySQL Circular Replication against floating slaves as a means to Synchronize DBs, here is a solution:

1 DB Master
4 DB Slaves


SETUP OF DB MASTER

  1. Install MySQL 5.1.x
  2. Make sure /etc/my.cnf
          server-id=1
          log-bin=mysql-bin
          expire-logs-days=14
          default-storgae-engine=InnoDB
  3. Startup MySQL
  4. RESET MASTER; (Clear Binary Logs From DB Master)
  5. Load in Data into Master
  6. GRANT SELECT,REPLICATION SLAVE ON . TO replicator@'%' IDENTIFIED BY 'repl-password';

SETUP OF DB SLAVES

  1. Install MySQL 5.1.x
  2. Make sure /etc/my.cnf
          server-id=(unique server id)
          log-bin=mysql-bin
          default-storage-engine=InnoDB
  3. Startup MySQL
  4. CHANGE MASTER TO MASTER_HOST='IP Address of DB Master', MASTER_PORT=3306, MASTER_USER='replicator', MASTER_PASSWORD='repl-password', MASTER_LOG_FILE='mysql-bin.000001', MASTER_LOG_POS=106;
  5. START SLAVE; (let replication catch up, check SHOW SLAVE STATUS\G)
  6. STOP SLAVE;
  7. CHANGE MASTER TO MASTER_HOST='IP Address of DB Distribution Master', MASTER_PORT=3306, MASTER_USER='replicator', MASTER_PASSWORD='repl-password', MASTER_LOG_FILE='mysql-bin.000001', MASTER_LOG_POS=106;
  8. START SLAVE;

IMPORT PROCESS

Once every DB Slave is prepared, now the migration of Data between the DB Master and DB Slave can proceed as follows (DM for DB Master, and DS for DB Slave):

  1. On DS, Run SHOW MASTER STATUS;
  2. On DS, Record binary log filename and position
  3. On DM, Run CHANGE MASTER TO MASTER_HOST='IP Address of DS', MASTER_PORT=3306, MASTER_USER='replicator', MASTER_PASSWORD='repl-password', MASTER_LOG_FILE='LogFile From Step2', MASTER_LOG_POS=(LogPos From Step2);
  4. On DM, Run START SLAVE; (let replication catch up; port changes introduced by DS to DM)
  5. On DM, STOP SLAVE;
  6. On DS, record line 2 of /var/lib/mysql/master.info (log file)
  7. On DS, record line 3 of /var/lib/mysql/master.info (log position)
  8. On DS, Run CHANGE MASTER TO MASTER_HOST='IP Address of DM', MASTER_PORT=3306, MASTER_USER='replicator', MASTER_PASSWORD='repl-password', MASTER_LOG_FILE='LogFile From Step6', MASTER_LOG_POS=(LogPos From Step7);
  9. On DS, Run START SLAVE; (let replication catch up; port changes introduced by DM to DS)
  10. On DS, STOP SLAVE;

CAVEAT

The import process is crucial !!! You must record the exact log file and log position correctly each and every time.

Give it a Try !!!

Let me know how it goes !!!

what happens with old CMS/blog websites?

6 votes

I've created a couple of little few page long websites for one time projects or conferences in mostly Wordpress, and I'm thinking about what will happen to those websites in the future. And I think I'm not alone, as there are a big number of sites out there, which is now only kept as an archive, but unlike in the 90s where everything was static HTML, these websites are now using some software to provide CMS functionality, even if its only for a few pages + search.

My problem is that with all these modular software (Wordpress, Joomla, etc.) you need to use various plugins and themes to make them usable and nice, but all these functionality brakes sooner or later. Which means that if you want to keep the website as is, you need to leave the old versions of the software. I mean forever.

On the other hand they are so popular (Wordpress has more than 100 million downloads now), that I would be surprised if they would not became a target for the most popular exploits in the near future. I don't know how safe these software are, but I have experienced what it means to continuously keep cleaning/fixing an osCommerce website with about 7 successful hacker attacks every month, till the sites owner agreed that its better close the site entirely and start building a new one.

As an alternative solution (but I really don't know if its possible), is there any way to make a whole site into a read-only mode? I mean something like making the database read-only, the file system read-only, disabling the admin interface and all the comment fields and just leaving the site as an archive, the only dynamic part being the search function.

Is it possible on file-system/database level? Will it help at all to keep hackers out? Is there any other solution? Please understand that my point is that it is not possible to keep CMS sites always updated forever, and even if some of as are fanatic enough to spend a night looking for fixing a broken theme/plugin which just broke after a core upgrade, 99% of the sites will end up in a "fixed" state; using a working but old CMS/plugins/theme combination forever.

I think 99% is a very generous estimate, but that's beside the point. The majority of the sites that end up in the state you are referring to only last as long as their domain registrations (especially since most Wordpress or OSCommerce deployments are usually set up as the root domain and service the entirety of the web presence.) So generally speaking, if the domain itself is in a state of neglect and abandonment, the natural expiration process will decommission it and it will no longer be accessible in general.

As for locking down an entire, sitewide state on one of these CMS systems, it could in theory be possible if one removed all write privileges for all the server files and revoked every database user privilege except SELECT. In most cases this would defeat the purpose of leaving the software for CMS there at all, since none of the records would updatable any longer (items in the case of OSCommerce, posts in the case of Wordpress.) But this would be highly dependent on the environment required by the particular CMS, and Wordpress for one is pretty particular about read/write permissions to work at all. It would make for an interesting experiment, but probably isn't a practical solution for what you're describing.

Taking the rendered content and building a static mirror is another option, and can be pretty easily automated by writing a script that could get the HTML content of the rendered pages and building static, linked alternatives. But this too is a bit impractical, especially in the case of a search (since this by its very definition requires database access.)

In short, it's an interesting idea, but ultimately sites that are neglected and whose owners are not committed to sustaining proper updates are doomed to expiration, and the natural course of Internet business and domain registration pretty often Darwinizes them.

Unique key on whole mysql table?

6 votes

Suppose I have a mysql table with two columns: A and B. Is it possible to have a unique key so that I can only insert a value only once in either A or B (once in the whole table)?

So if column A contains 'qwe' and B contains 'asd' then these two values cannot be inserted anymore in either column.

this will not work:

UNIQUE KEY `A` (`A`,`B`),
UNIQUE KEY `A_2` (`A`),
UNIQUE KEY `B` (`B`),
UNIQUE KEY `B_2` (`B`,`A`)

thanks.

edit: I was able to accomplish this with the following trigger:

delimiter |
create trigger unique_check before insert on mytable
       for each row begin
              declare alreadyexists integer;
          select count(*) > 0 into alreadyexists from mytable
                 where A=NEW.B or B=NEW.A;
          IF alreadyexists = 1 THEN begin
             DECLARE dummy INT;
         SELECT 'A OR B already exists' INTO dummy FROM mytable
            WHERE nonexistent = 'value';
 end;
 END IF;
 END;|

However, I do not see the 'A OR B already exists' error message, but:

ERROR 1054 (42S22): Unknown column 'nonexistent' in 'where clause'

Thanks again!

Yes it's possible.

1 way is

You need to create a BEFORE INSERT TRIGGER and return error if the value is already found in other columns/tables.

From this blog post

MySQL Triggers: How do you abort an INSERT, UPDATE or DELETE with a trigger? On EfNet’s #mysql someone asked:

How do I make a trigger abort the operation if my business rule fails?

In MySQL 5.0 and 5.1 you need to resort to some trickery to make a trigger fail and deliver a meaningful error message. The MySQL Stored Procedure FAQ says this about error handling:

SP 11. Do SPs have a “raise” statement to “raise application errors”? Sorry, not at present. The SQL standard SIGNAL and RESIGNAL statements are on the TODO.

Perhaps MySQL 5.2 will include SIGNAL statement which will make this hack stolen straight from MySQL Stored Procedure Programming obsolete. What is the hack? You’re going to force MySQL to attempt to use a column that does not exist. Ugly? Yes. Does it work? Sure.

CREATE TRIGGER mytabletriggerexample
BEFORE INSERT
FOR EACH ROW BEGIN
IF(NEW.important_value) < (fancy * dancy * calculation) THEN
    DECLARE dummy INT;

    SELECT Your meaningful error message goes here INTO dummy 
        FROM mytable
      WHERE mytable.id=new.id
END IF; END;

Another way

You can also do with Transactions

use a procedure with transaction to insert data into transactional table (InnoDB),

In the trigger write on error condition:

set @error=1; 

In the procedure something like this:

set @error=0; 
start transaction 
do insert 
if @error>0 then rollback; 
else commit; 

Which is better database design?

6 votes

Given a site like StackOverflow, would it be better to create num_comments column to store how many comments a submission has and then update it when a comment is made or just query the number of rows with the COUNT function? It seems like the latter would be more readable and elegant but the former would be more efficient. What does SO think?

Definitely to use COUNT. Storing the number of comments is a classic de-normalization that produces headaches. It's slightly more efficient for retrieval but makes inserts much more expensive: each new comment requires not only an insert into the comments table, but a write lock on the row containing the comment count.

How to determine if two users share some information without mutliple queries

6 votes

Hi.

I'm trying to work out how to allow a user, [V], visiting another user's profile, [A], to see all the groups that user [A] is in and also which of the groups they're both part of.

The tables are:

USERS TABLE
user_id | name  | email....
1       | Drent | drents... [V]
2       | Dude2 | dude2@... [A]
3       | Dude3 | dude3@...

GROUPS TABLE
group_id | group_name   | joining_policy
1        | The Crazies  | invite_only
2        | Team OSM     | open
3        | My Group     | approval_needed

GOUP_USERS TABLE
group_id | user_id
1        | 1
1        | 2
3        | 2
2        | 1
2        | 3

I can do a general query for all the groups user [A] is part of:

SELECT groups.group_name FROM groups JOIN group_users 
ON groups.group_id=group_users.group_id WHERE group_users.user_id=2 LIMIT 0,10

Which would of course return this:

The Crazies - <a href="$row['group_id']?join=$my_user_id>Join This Group</a>
My Groups - <a href="$row['group_id']?join=$my_user_id>Join This Group</a>

But what I want is a way for [V] to see which groups they share and which they can join

For example:

The Crazies - You're already a member
My Groups - <a href="$row['group_id']?join=$my_user_id>Join This Group</a>

At the moment I can only think of doing this using a subquery for each row returned but I'm sure there's an easier, more efficient way to do it using another join or a WHERE IN but so far everything I've tried hasn't worked.

Something like:

SELECT groups.group_name FROM groups JOIN group_users  
ON groups.group_id=group_users.group_id JOIN users AS visitor 
ON visitor.user_id=group_users.user_id WHERE group_users.user_id=2 LIMIT 0,10

but I know this doesn't work.

Any help would be greatly appreciated.

SELECT  g.*, guv.group_id IS NOT NULL AS is_member
FROM    group_users gua
JOIN    group g
ON      g.id = gua.group_id
LEFT JOIN
        group_users guv
ON      guv.group_id = gua.group_id
        AND guv.user_id = $v
WHERE   gua.user_id = $a

Why not DbConnection instead of SqlConnection or OracleConnection?

5 votes

I'm a Java retread pretty new to C#. I'm hoping to stay out of trouble when I crank out a bunch of DML code in the next few weeks.

I'm used to the idea of using JDBC's abstract classes like Connection, Statement, and the like. C# offers similar abstract classes like DbConnection, DbCommand, and so forth, in the System.Data.Common namespace.

But, most of the examples I've seen -- both in MS documentation and other books -- use the concrete classes: SqlConnection, OracleCommand, etc. This kind of concreteness even shows up in the mySQL documentation.

What is the best practice in this area? Is there some strong reason to choose concrete table-server-specific rather than abstract classes for this purpose? (I'm aware of the hazards of downcasting abstract to concrete, of course).

The abstract classes were not part of the first versions of the framework, they were introduced in version 2.0. A lot of examples were written before that, or are based on examples that were written before that.

Using concrete or abstract classes is mostly a matter of taste. It's a nice idea to write code that could work with any database, but my experience is that you don't switch database systems very often, and if you do there are so many changes that you need to do that it doesn't matter much if you used abstract classes or not.

Adding exactly 100 values to database using ajax

5 votes

Hey guys, 3:36am and I'm needing a mental boost.

Simple question, what is the easiest/fastest way to add 100 points to a database. Please assume all writes will not work due to duplicates, bad data, etc.

I'm trying to update a database with exactly 100 values.

Once I have a good piece of data, I need to add it to the database and I use a function called updateDB.

This function just writes a lat/lng coordinate to the database. If there is a duplicate or the write fails, I send "error" from php and the loop should continue collecting data until I have exactly 100 points to the database. Here's the function I'm using.

cct is used for xss prevention, please ignore it, this works fine.

////more above this
if(100-completed > dispatched)
    dispatched++;
    updateDB(lat,lng);
/// more junk and then this function
function updateDB(lat,lng)
{
    var cct = $("input[name=csrf_mysite]").val();
    $.ajax({
        type: "POST",
        url: "/form",
        data: { 
            'lat': lat,
            'lng': lng,
            'id_set': id_set,
            'csrf_complexity': cct },
        success: function(result) {
            var obj = jQuery.parseJSON(result);
            if( obj.status === "OK" )
            {
                completed++;
                var marker = new google.maps.Marker(
                {
                    icon: markerIcon,
                    position: new google.maps.LatLng(lat, lng),
                    map: map
                });
                $( "#progressbar" ).progressbar( "option", {
                    value: completed,
                    max: 100
                });
                $("#amount").text("Getting image " + completed + " of 100");
            }
        },
        error: function(data){
            //alert(data.responseText);
            },
        complete: function(data){
            if(completed == 100)
                window.location = "/start/curate";
            dispatched--;
        }
    });
}

This function does not work. So any idea why?

It should work simply. Call updateDB until it either reaches 100 added values and only call updateDB when there is no possibility that there will be extra calls. Dispatch does not decrement properly so I'm assuming complete isn't called on every event.

Ideas? Or any other way to do this would be awesome.

Write your server side script to respond with a figure - Then in the first request you sent 100 data points:

Browser             Server
  |                   |
  |---> 100 data ---> |
  |<- send 20 more <- |
  |                   |
  |--->  20 data ---> |
  |<- send  1 more <- |
  |                   |
  |--->  1  data ---> |
  |<- send  0 more <- |
  |                   |
  v                   v

Like that you don't get the kind of sync trouble that comes from counting at the client end how much information has been processed at the server end - the server processes the data, excludes the duplicates, ill-formed etc., and the server counts what it needs.

All the client has to do is send all it can and poll the server for how much more is needed. Also you want to send multiple points at once because of the overheads of an Ajax request, but you don't want to send much more than is actually needed, because that too is wasteful.

Hope this helps.

When we don't need a primary key for our table?

5 votes

Will it ever happen that we design a table that doesn't need a primary key?

No.

The primary key does a lot of stuff behind-the-scenes, even if your application never uses it.

For example: clustering improves efficiency (because heap tables are a mess).

Not to mention, if ANYONE ever has to do something on your table that requires pulling a specific row and you don't have a primary key, you are the bad guy.

DB design and optimization considerations for a social application

4 votes

The usual case. I have a simple app that will allow people to upload photos and follow other people. As a result, every user will have something like a "wall" or an "activity feed" where he or she sees the latest photos uploaded from his/her friends (people he or she follows).

Most of the functionalities are easy to implement. However, when it comes to this history activity feed, things can easily turn into a mess because of pure performance reasons.

I have come to the following dilemma here: i can easily design the activity feed as a normalized part of the database, which will save me writing cycles, but will enormously increase the complexity when selecting those results for each user (for each photo uploaded within a certain time period, select a certain number, whose uploaders I am following / for each person I follow, select his photos )

An optimization option could be the introduction of a series of threshold constraints which, for instance would allow me to order the people I follow on the basis of the date of their last upload, even exclude some, to save cycles, and for each user, select only the 5 (for example) last uploaded photos.

The second approach is to introduce a completely denormalized schema for the activity feed, in which every row represents a notification for one of my followers. This means that every time I upload a photo, the DB will put n rows in this "drop bucket", n meaning the number of people I follow, i.e. lots of writing cycles. If I have such a table, though, I could easily apply some optimization techniques such as clever indexing, as well as pruning entries older than a certain period of time (queue).

Yet, a third approach that comes to mind, is even a less denormalized schema where the server side application will take some part of the complexity off the DB. I saw that some social apps such as friendfeed, heavily rely on the storage of serialized objects such as JSON objects in the DB.

I am definitely still mastering the skill of scalable DB design, so I am sure that there are many things I've missed, or still to learn. I would highly appreciate it if someone could give me at least a light in the right direction.

I would probably start with using a normalized schema so that you can write quickly and compactly. Then use non transactional (no locking) reads to pull the information back out making sure to use a cursor so that you can process the results as they're coming back as opposed to waiting for the entire result set. Since it doesn't sound like the information has any particular critical implications you don't really need to worry about a lock of the concerns that would normally push you away from transactional reads.

Ruby style relational tables in PHP

4 votes

I know ruby (on rails) uses lots of "magic", but I use PHP, is there not a way of implementing the rails-like:

class Player < ActiveRecord::Base
    has_many :cards
end

in PHP it would make so many peoples lives so much easier. Are there perhaps frameworks/ORM's that provide similar functionality etc?

The term you're looking for is an "Object Relational Mapper" (ORM). The one you cite is a component of the Rails framework, called ActiveRecord.

PHP ActiveRecord is one project that attempts to provide this, though the last release was in July 2010. I haven't actually used it; I just switched to Rails. :)