Best database questions in December 2010

Getting-started: Setup Database for Node.js

25 votes

I am new to node.js but am excited to try it out. I am using Express as a web framework, and Jade as a template engine. Both were easy to get setup following this tutorial from Node Camp.

However the one problem I am finding is I can't find a simple tutorial for getting a DB set up. I am trying to build a basic chat application (store session and message).

Does anyone know of a good tutorial?

This other SO post talks about dbs to use- but as this is very different from the Django/MySQL world I've been in, I want to make sure I understand what is going on.

Thanks!

I assume you have npm installed the correct way using one of these snippets(I used the top one).

Redis

I would use redis as a database. For one it is really fast, persistent. You need to install it, but that is really easy.

make

Redis-cli

Next you should play with redis yourself. I would advice you to look at this excellent tutorial by Simon Willison. He and I also advice you to just play with the redis-cli to get a feeling of the database.

Redis client

Finally you need to install a redis client. I would advise you to use mranney's node_redis because I think it is the fastest and most actively developed client.

Installation

npm install hiredis redis

Simple example, included as example.js:

var redis = require("redis"),
    client = redis.createClient();

client.on("error", function (err) {
    console.log("Error " + err);
});

client.set("string key", "string val", redis.print);
client.hset("hash key", "hashtest 1", "some value", redis.print);
client.hset(["hash key", "hashtest 2", "some other value"], redis.print);
client.hkeys("hash key", function (err, replies) {
    console.log(replies.length + " replies:");
    replies.forEach(function (reply, i) {
        console.log("    " + i + ": " + reply);
    });
    client.quit();
});

Storing sessions in database

Also the author of express has created a library to handle your sessions using redis.

Installation:

npm install connect-redis

Example:

var connect = require('connect')
      , RedisStore = require('connect-redis');

connect.createServer(
  connect.cookieDecoder(),
  // 5 minutes
  connect.session({ store: new RedisStore({ maxAge: 300000 }) })
);

Storing messages in database

I think I would use a sorted set for this. Store the messages using ZADD and retrieve them using ZRANK, ZRANGEBYSCORE.

Socket.io

Finally if you are trying to create a simple chat I would advise you to have a look at socket.io.

socket.io aims to make realtime apps possible in every browser and mobile device, blurring the differences between the different transport mechanisms.

I also created a chat using socket.io which I posted on stackoverflow. Adding persistence + authentication should be a breeze.

Useful stock SQL datasets?

15 votes

Does anyone know of any resources that provide good, useful stock datasets? For example, I've downloaded a SQL script that includes all of the U.S. states, cities, and zipcodes. This saved me a lot of time in a recent application where I wanted to be able to do lookups by geography. Are any of you aware of other useful datasets that are freely available for download?

For example:

  • Blacklisted IP addresses
  • Names of colleges/universities
  • Names of corporations/stock symbols

Anyone have any recommendations?

EDIT:

As an example, here is the location where I found a MySQL script containing all of the U.S. zip codes and their corresponding latitude/longitude. Has anyone else found similarly useful datasets in SQL that can be easily imported and used?

http://www.chrissibert.com/blog/wp-content/uploads/2009/06/zipcodes.7z

EDIT 2:

To clarify what type of datasets I'm talking about... I'm referring to datasets that can be immediately useful for applications, can be applied across a variety of scenarios, and typically represent information that is easy to find for small cases but harder to compile for larger data sets. The zip code database is a great example to me. It's not hard to get the lat/long for a single given zip code. But, it's a bit more time consuming to get the values for all valid zip codes in the U.S. This data is also not useful to a single industry or business sector, but can be applied across a range of applications.

Lots of links to open data sets here:

http://www.readwriteweb.com/archives/where_to_find_open_data_on_the.php

although I doubt any of them will generate SQL statements for you.

Scala collection-like SQL support as in LINQ

10 votes

As far as I understand the only thing LINQ supports, which Scala currently doesn't with its collection library, is the integration with a SQL Database.

As far as I understand LINQ can "accumulate" various operations and can give "the whole" statement to the database when queried to process it there, preventing that a simple SELECT first copies the whole table into data structures of the VM.

If I'm wrong, I would be happy to be corrected.

If not, what is necessary to support the same in Scala?

Wouldn't it possible to write a library which implements the collection interface, but doesn't have any data structures backing it but a String which gets assembled with following collection into the required Database statement?

Or am I completely wrong with my observations?

As the author of ScalaQuery, I don't have much to add to Stilgar's explanation. The part of LINQ which is missing in Scala is indeed the expression trees. That is the reason why ScalaQuery performs all its computations on Column and Table types instead of the basic types of those entities.

You declare a table as a Table object with a projection (tuple) of its columns, e.g.:

class User extends Table[(Int, String)] {
  def id = column[Int]("id", O.PrimaryKey, O.AutoInc)
  def name = column[String]("name")
  def * = id ~ name
}

User.id and User.name are now of type Column[Int] and Column[String] respectively. All computations are performed in the Query monad (which is a more natural representation of database queries than the SQL statements that have to be created from it). Take the following query:

val q = for(u <- User if u.id < 5) yield u.name

After some implicit conversions and desugaring this translates to:

val q:Query[String] =
  Query[User.type](User).filter(u => u.id < ConstColumn[Int](5)).map(u => u.name)

The filter and map methods do not have to inspect their arguments as expression trees in order to build the query, they just run them. As you can see from the types, what looks superficially like "u.id:Int < 5:Int" is actually "u.id:Column[Int] < u.id:Column[Int]". Running this expression results in a query AST like Operator.Relational("<", NamedColumn("user", "id"), ConstColumn(5)). Similarly, the "filter" and "map" methods of the Query monad do not actually perform filtering and mapping but instead build up an AST that describes these operations.

The QueryBuilder then uses this AST to construct the actual SQL statement for the database (with a DBMS-specific syntax).

An alternative approach has been taken by ScalaQL which uses a compiler plugin to work directly with expression trees, ensure that they only contain the language subset which is allowed in database queries, and construct the queries statically.

How can I avoid NULLs in my database, while also representing missing data?

8 votes

In SQL and Relational Theory (C.J. Date, 2009) chapter 4 advocates avoiding duplicate rows, and also to avoid NULL attributes in the data we store. While I have no troubles avoiding duplicate rows, I am struggling to see how I can model data without making use of NULL. Take the following, for example - which is a bit from work.

We have an artist table, which has, amongst other columns, a gender column. This is a foreign key to the gender table. However, for some artists, we don't know their gender - for example we've been given a list of new music which has no descriptions of the artist. How, without using NULL is one meant to represent this data? The only solution I see is to add a new gender, "unknown", to the gender table.

While I am thoroughly enjoying this book, I was really disappointed when the chapter concluded with:

Of course, if nulls are prohibited, then missing information will have to be handled by some other means. Unfortunately, those other means are much too complex to be discussed in detail here.

Which is a real shame - because this was the solution I was waiting to read about! There is a reference to read the appendix which has lots of publications to read, but I was hoping for a little bit more of a down to earth summary before I dived into reading these.


I'm getting a few people commenting that they don't understand why I wish to avoid 'NULL' so I will quote the book again. Take the following query:

SELECT s.sno, p.pno
  FROM s, p
 WHERE s.city <> p.city
    OR p.city <> 'Paris'

Now, take the example that s.city is London, and p.city is Paris. In this case, London <> Paris, so the query is true. Now take the case that p.city is not Paris, and is infact xyz. In this case, (London <> xyz) OR (xyz <> Paris) is also True. So, given any data - this query is true. However, if xyz is 'NULL' the scenario changes. In this case both of these expressions are neither True nor False, they are in fact, Unknown. And in this case because the result is unknown you will not get any rows returned.

The move from 2 value logic to 3 value logic can easily introduce bugs like this. Infact, I just introduced one at work which motivated this very post. I wanted all rows where the type != 0 However, this actually ends up matching type == 0 OR type IS NULL - confusing behavior.

Whether or not I model my data with or without NULL in the future is unclear, but I'm very curious what the other solutions are. (I too have always been of the argument that if you don't know, you should use NULL).

Everybody's talking and no one except dportas and Walter can even understand the question. Ok, so 95% of the people on SO do not understand The Null Problem, and feel threatened because their databases are full of Nulls, they want to convert the seeker. Priceless. How is anyone going to learn when they are arguing ?

Good on you, for eliminating Nulls. I have never allowed Nulls in any of my databases.

Of course, if nulls are prohibited, then missing information will have to be handled by some other means. Unfortunately, those other means are much too complex to be discussed in detail here.

Actually it is not so hard at all. There are three alternatives.

  1. Here's a paper on How To Handle Missing Information Without Using NULL by H Darwen, that may help to get your head around the problem.

1.1. Sixth Normal Form is the answer. But you do not have to normalise your entire database to 6NF. For each column that is optional, you need a child table off the main table, with just the PK, which is also the FK, because it is a 1::0-1 relation. Other than the PK, the only column is the optional column.

Look at this Data Model; AssetSerial on page 4 is a classic case: not allAssets have SerialNumbers; but when they do, I want them to store them; more important I want to ensure that they are Unique.

(For the OO people out there, incidentally, that is a three level class diagram in Relational notation, a "Concwete Table Inheritance", no big deal, we've had it fro 30 years.)

1.2. For each such table, use a View to provide the 5NF form of the table. Sure, use Null (or any value that is appropriate for the column) to identify the absence of the column for any row. But do not update via the view.

1.3 Do not use straight joins to grab the 6NF column. Do not use outer joins, either (and have the server fill in a Null for the missing rows). Use a subquery to populate the column, and specify the value that you want returned for a missing value (except if you have Oracle, because its Subquery processing is even worse than its set processing). Eg. and just an eg. you can convert a numeric column to string, and use "Missing" for the missing rows.

When you do not want to go that far (6NF), you have two more options.
.
2. You can use Null substitutes. I use CHAR(0) for character colomns and 0 for numeric. But I do not allow that for FKs. Obviously you need a value that is outside the normal range of data. This does not allow Three Valued Logic.
.
3. In addition to (2), for each Nullable column, you need a boolean Indicator. For the example of the Sex column, the Indicator would be something like SexIsMissing or SexLess (sorry). This allows very tight Three Valued Logic. Many people in that 5% like it because the db remains at 5NF (and less tables); the columns with missing info are loaded with values that are never used; they are only used if the Indicator is false. If you have an enterprise db, you can wrap that in a Function, and always use the UDF, not the raw column.

Of course, in all cases, you can never get away from writing code that is required to handle the missing info. Whether it is ISNULL(), or a subquery for the 6NF column, or an Indicator to check before using the value, or an UDF.

If Null has a specific meaning ... then it is not a Null ! By definition, Null is the Unknown Value.

How do we implement an IS-A Relationship ?

8 votes

We implement an One-to-Many relationship by adding one Table's PK, as FK to the other Table. We implement a Many-to-Many relationship by adding 2 Table's PKs to a third Table.

How do we implement an IS-A Relationship ?

The Entities are TECHNICIAN and ADMINISTRATIVE which both are EMPLOYEE. I could just use an extra field in the Table EMPLOYEE(id, name, surname, role, ...AdminFields..., ...TechFields...)

but i would like to explore the IS-A option.

EDIT: I did as Donnie suggested, but without the role field.

I did as Donnie suggested, but without the role field, because it complicates things. This is the final implementation:

DDL:

CREATE TABLE Employee (
ast VARCHAR(20) not null,
firstname VARCHAR(200) not null,
surname VARCHAR(200) not null,
...
PRIMARY KEY(ast)
);

CREATE TABLE Administrative (
employee_ast VARCHAR(20) not null REFERENCES Employee(ast),
PRIMARY KEY(employee_ast)
);

CREATE TABLE Technical (
employee_ast VARCHAR(20) not null REFERENCES Employee(ast),
...
PRIMARY KEY(employee_ast)
);

ER Diagram:

ERD

In this model there are no Employees of Generic Type. Here, an Employee can only be Administrative or Technical.

What are best practices for collecting, maintaining and ensuring accuracy of a huge data set?

8 votes

I am posing this question looking for practical advice on how to design a system.

Sites like amazon.com and pandora have and maintain huge data sets to run their core business. For example, amazon (and every other major e-commerce site) has millions of products for sale, images of those products, pricing, specifications, etc. etc. etc.

Ignoring the data coming in from 3rd party sellers and the user generated content all that "stuff" had to come from somewhere and is maintained by someone. It's also incredibly detailed and accurate. How? How do they do it? Is there just an army of data-entry clerks or have they devised systems to handle the grunt work?

My company is in a similar situation. We maintain a huge (10-of-millions of records) catalog of automotive parts and the cars they fit. We've been at it for a while now and have come up with a number of programs and processes to keep our catalog growing and accurate; however, it seems like to grow the catalog to x items we need to grow the team to y.

I need to figure some ways to increase the efficiency of the data team and hopefully I can learn from the work of others. Any suggestions are appreciated, more though would be links to content I could spend some serious time reading.

THANKS!

Kyle

Use visitors.

  1. Even if you have one person per item, there will be wrong records, and customers will find it. So, let them mark items as "inappropiate" and make a short comment. But don't forget, they're not your employees, don't ask them too much; see Facebook's "like" button, it's easy to use, and requires not too much energy from the user. Good performance/price. If there would be a mandatory field in Facebook, which asks "why do you like it?", no one should use that function.

  2. Visitors also helps you implicite way: they visit item pages, and use search function (I mean both internal search engine and external ones, like Google). You can gain information from visitors' activity, say, set up the order of the most visited items, then you should concentrate more human forces on the top of the list, and less for the "long tail".

Should not OLAP database be denormalized for reading performance?

7 votes

I always thought that databases should be denormalized for reading, as it is done for OLAP database design, and not exaggerated much further 3NF for OLTP design.

PerformanceDBA in various posts, for ex., in Performance of different aproaches to time-based data defends the paradigm that database should be always well-designed by normalization to 5NF and 6NF (Normal Form).
Have I understood it correctly (and what had I understood correctly?)?
What's wrong with traditional denormalization approach/paradigm design of OLAP databases (below 3NF) and advices that 3NF is enough for most practical cases of OLTP databases?

Please either downvote or upvote, do not mix both.

Update:
I have multiple posts showing 0 votes having 3-7 upvotes + 3-7 downvotes. I prefer having 6-14 downvotes than 0 total score. For ex., this question in a few minutes after posting had 0 with 4 votes (-2+2). I prefer having -4 instead.
I do not care about personal rep but about underepresentation of interest to topics.

Update3: for example:

Update4:
OK, I should confess that I could never grasp the theories that denormalization facilitates data reading performance. Can anybody give me references with good logical explanations of this and of the contrary beliefs?
What are sources to which I can refer to support (to convince my stakeholders) that OLAP/DataWareHousing databases should be normalized?

Update5:
To improve visibility I copied here from comments:

"It would be nice if participants would add (disclose) how many real-life (no science projects included) data-warehouse implementations in 6NF they have seen or participated in. Kind of a quick-pool. Me = 0. – Damir Sudarevic"

http://en.wikipedia.org/wiki/Data_Warehouse tells:

"The normalized approach [vs. dimensional one by Ralph Kimball], also called the 3NF model, whose supporters are referred to as “Inmonites”, believe in Bill Inmon's approach in which it is stated that the data warehouse should be modeled using an E-R model/normalized model"

It looks like normalized datawarehousing approach (by Bill Inmon) is perceived as not exceeding 3NF (?)
I just want to understand what is the origin of the myth (or ubiquitous axiomatic belief) that datawarehousing/OLAP is synonym of denormalization?

Update6:
Damir Sudarevic answered that they are well-paved approach. Let me return to question why is denormalization believed to facilitate reading?

Mythology

I always thought that databases should be denormalized for reading, as it is done for OLAP database design, and not exaggerated much further 3NF for OLTP design.

There's a myth to that effect. In the Relational Database context, I have re-implemented six very large so-called "de-normalised" "databases"; and executed over eighty assignments correcting problems on others, simply by Normalising them, applying Standards and engineering principles. I have never seen any evidence for the myth. Only people repeating the mantra as if it were some sort of magical prayer.

Normalisation vs Un-normalised

("De-normalisation" is a fraudulent term I refuse to use it.)

This is a scientific industry (at least the bit that delivers software that does not break; that put people on the Moon; that runs banking systems; etc). It is governed by the laws of physics, not magic. Computers and software are all finite, tangible, physical objects that are subject to the laws of physics. According to the secondary and tertiary education I received:

  • it is not possible for a bigger, fatter, less organised object to perform better than a smaller, thinner, more organised object.

  • Normalisation yields more tables, yes, but each table is much smaller. And even though there are more tables, there are in fact (a) less joins and (b) the joins are faster because the sets are smaller. Fewer Indices are required overall, because each smaller table needs less indices. Normalised tables also yield much shorter row sizes.

  • for any given set of resources, Normalised tables:

    • fit more rows into the same page size
    • therefore fit more rows into the same cache space, therefore overall throughput is increased)
    • therefore fit more rows into the same disk space, therefore the no of I/Os is reduced; and when I/O is called for, each I/O is more efficient.
      .
  • it is not possible for an object that is heavily duplicated to perform better than an object that is stored as a single version of the truth. Eg. when I removed the 5 x duplication at the table and column level, all the transactions were reduced in size; the locking reduced; the Update Anomalies disappeared. That substantially reduced contention and therefore increased concurrent use.

The overall result was therefore much, much higher performance.

In my experience, which is delivering both OLTP and OLAP from the same database, there has never been a need to "de-normalise" my Normalised structures, to obtain higher speed for read-only (OLAP) queries. That is a myth as well.

  • No, the "de-normalisation" requested by others reduced speed, and it was eliminated. No surprise to me, but again, the requesters were surprised.

Many books have been written by people, selling the myth. It needs to be recognised that these are non-technical people; since they are selling magic, the magic they sell has no scientific basis, and they conveniently avoid the laws of physics in their sales pitch.

(For anyone who wishes to dispute the above physical science, merely repeating the mantra will no have any effect, please supply specific evidence supporting the mantra.)

Why is the Myth Prevalent ?

Well, first, it is not prevalent among scientific types, who do not seek ways of overcoming the laws of physics.

From my experience, I have identified three major reasons for the prevalence:

  1. For those people who cannot Normalise their data, it is a convenient justification for not doing so. They can refer to the magic book and without any evidence for the magic, they can reverently say "see a famous writer validates what I have done". Not Done, most accurately.

  2. Many SQL coders can write only simple, single-level SQL. Normalised structures require a bit of SQL capability. If they do not have that; if they cannot produce SELECTs without using temporary tables; if they cannot write Sub-queries, they will be psychologically glued to the hip to flat files (which is what "de-normalised" structures are), which they can process.

  3. People love to read books, and to discuss theories. Without experience. Especially re magic. It is a tonic, a substitute for actual experience. Anyone who has actually Normalised a database correctly has never stated that "de-normalised is faster than normalised". To anyone stating the mantra, I simply say "show me the evidence", and they have never produced any. So the reality is, people repeat the mythology for these reasons, without any experience of Normalisation. We are herd animals, and the unknown is one of our biggest fears.

    That is why I always include "advanced" SQL and mentoring on any project.

My Answer

This Answer is going to be ridiculously long if I answer every part of your question or if I respond to the incorrect elements in some of the other answers. Eg. the above has answered just one item. Therefore I will answer your question in total without addressing the specific components, and take a different approach. I will deal only in the science related to your question, that I am qualified in, and very experienced with.

Let me present the science to you in manageable segments. Typical First Generation "databases"
The typical model of the six large scale full implementation assignments.

  • These were the closed "databases" commonly found in small firms, and the organisations were large banks
  • very nice for a first generation, get-the-app-running mindset, but a complete failure in terms of performance, integrity and quality
  • they were designed for each app, separately
  • reporting was not possible, they could only report via each app
  • since "de-normalised" is a myth, the accurate technical definition is, they were un-normalised
    • In order to "de-normalise" one must Normalise first; then reverse the process a little in every instance where people showed me their "de-normalised" data models, the simple fact was, they had not Normalised at all; so "de-normalisation" was not possible; it was simply un-normalised
  • since they did not have much Relational technology, or the structures and control of Databases, but they were passed off as "databases", I have placed those words in quotation marks
  • as is scientifically guaranteed for un-normalised structures, they suffered multiple versions of the truth (data duplication) and therefore high contention and low concurrency, within each of them
  • they had an additional problem of data duplication across the "databases"
  • the organisation was trying to keep all those duplicates synchronised, so they implemented replication; which of course meant an additional server; ETL and synching scripts to be developed; and maintained; etc
  • needless to say, the synching was never quite enough and they were forever changing it
  • with all that contention and low throughput, it was no problem at all justifying a separate server for each "database". It did not help much.

So we contemplated the laws of physics, and we applied a little science. 5NF Corporate Database
We implemented the Standard concept that the data belongs to the corporation (not the departments) and the corporation wanted one version of the truth. The Database was pure Relational, Normalised to 5NF. Pure Open Architecture, so that any app or report tool could access it. All transactions in stored procs (as opposed to uncontrolled strings of SQL all over the network). The same developers for each app coded the new apps, after our "advanced" education.

Evidently the science worked. Well, it wasn't my private science or magic, it was ordinary engineering and the laws of physics. All of it ran on one database server platform; two pairs (production & DR) of servers were decommissioned and given to another department. The 5 "databases" totalling 720GB were Normalised into one Database totalling 450GB. About 700 tables (many duplicates and duplicated columns) were normalised into 500 unduplicated tables. It performed much faster, as in 10 times faster overall, and more than 100 times faster in some functions. That did not surprise me, because that was my intention, and the science predicted it, but it surprised the people with the mantra.

More Normalisation

Well, having had success with Normalisation in every project, and confidence with the science involved, it has been a natural progression to Normalise more, not less. In the old days 3NF was good enough, and later NFs were not yet identified. In the last 20 years, I have only delivered databases that had zero update anomalies, so it turns out by todays definitions of NFs, I have always delivered 5NF.

Likewise, 5NF is great but it has its limitations. Eg. Pivoting large tables (not small result sets as per the MS PIVOT Extension) was slow. So I (and others) developed a way of providing Normalised tables such that Pivoting was (a) easy and (b) very fast. It turns out, now that 6NF has been defined, that those tables are 6NF.

Since I provide OLAP and OLTP from the same database, I have found that, consistent with the science, the more Normalised the structures are:

  • the faster they perform

  • and they can be used in more ways (eg Pivots)

So yes, I have consistent and unvarying experience, that not only is Normalised much, much faster than un-normalised or "de-normalised"; more Normalised is even faster than less normalised.

One sign of success is growth in functionality (the sign of failure is growth in size without growth in functionality). Which meant they immediately asked us for more reporting functionality, which meant we Normalised even more, and provided more of those specialised tables (which turned out years later, to be 6NF).

Progressing on that theme. I was always a Database specialist, not a data warehouse specialist, so my first few projects with warehouses were not full-blown implementations, but rather, they were substantial performance tuning assignments. They were in my ambit, on products that I specialised in. Typical Data Warehouse
Let's not worry about the exact level of normalisation, etc, because we are looking at the typical case. We can take it as given that the OLTP database was reasonably normalised, but not capable of OLAP, and the organisation had purchased a completely separate OLAP platform, hardware; invested in developing and maintaining masses of ETL code; etc. And following implementation then spent half their life managing the duplicates they had created. Here the book writers and vendors need to be blamed, for the massive waste of hardware and separate platform software licences they cause organisations to purchase.

  • If you have not observed it yet, I would ask you to notice the similarities between the Typical First Generation "database" and the Typical Data Warehouse

Meanwhile, back at the farm (the 5NF Databases above) we just kept adding more and more OLAP functionality. Sure the app functionality grew, but that was little, the business had not changed. They would ask for more 6NF and it was easy to provide (5NF to 6NF is a small step; 0NF to anything, let alone 5NF, is a big step; an organised architecture is easy to extend).

One major difference between OLTP and OLAP, the basic justification of separate OLAP platform software, is that the OLTP is row-oriented, it needs transactionally secure rows, and fast; and the OLAP doesn't care about the transactional issues, it needs columns, and fast. That is the reason all the high end BI or OLAP platforms are column-oriented, and that is why the OLAP models (Star Schema, Dimension-Fact) are column-oriented.

But with the 6NF tables:

  • there are no rows, only columns; we serve up rows and columns at same blinding speed

  • the tables (ie. the 5NF view of the 6NF structures) are already organised into Dimension-Facts. In fact they are organised into more Dimensions than any OLAP model would ever identify, because they are all Dimensions.

  • Pivoting entire tables with aggregation on the fly (as opposed to the PIVOT of a small number of derived columns) is (a) effortless, simple code and (b) very fast Typical Data Warehouse

What we have been supplying for many years, by definition, is Relational Databases with at least 5NF for OLTP use, and 6NF for OLAP requirements.

  • Notice that it is the very same science that we have used from the outset; to move from Typical un-normalised "databases" to 5NF Corporate Database. We are simply applying more of the proven science, and obtaining higher orders of functionality and performance.

  • Notice the similarity between 5NF Corporate Database and 6NF Corporate Database

  • The entire cost of separate OLAP hardware, platform software, ETL, administration, maintenance, are all eliminated.

  • There is only one version of the data, no update anomalies or maintenance thereof; the same data served up for OLTP as rows, and for OLAP as columns

The only thing we have not done, is to start off on a new project, and declare pure 6NF from the start. That is what I have lined up next.

What is Sixth Normal Form ?

Assuming you have a handle on Normalisation (I am not going to not define it here), the non-academic definitions relevant to this thread are as follows. Note that it applies at the table level, hence you can have a mix of 5NF and 6NF tables in the same database:

  • Fifth Normal Form: all Functional Dependencies resolved across the database
    • in addition to 4NF/BCNF
    • every non-PK column is 1::1 with its PK
    • and to no other PK
    • No Update Anomalies
      .
  • Sixth Normal Form: is the irreducible NF, the point at which the data cannot be further reduced or Normalised (there will not be a 7NF)
    • in addition to 5NF
    • the row consists of a Primary Key, and at most, one non-key column
    • eliminates The Null Problem

What Does 6NF Look Like ?

The Data Models belong to the customers, and our Intellectual Property is not available for free publication. But I do attend this web-site, and provide specific answers to questions. You do need a real world example, so I will publish the Data Model for one of our internal utilities.

This one is for the collection of server monitoring data (enterprise class database server and OS) for any no of customers, for any period. We use this to analyse performance issues remotely, and to verify any performance tuning that we do. The structure has not changed in over ten years (added to, with no change to the existing structures), it is typical of the specialised 5NF that many years later was identified as 6NF. Allows full pivoting; any chart or graph to be drawn, on any Dimension (22 Pivots are provided but that is not a limit); slice and dice; mix and match. Notice they are all Dimensions.

The monitoring data or Metrics or vectors can change (server version changes; we want to pick up something more) without affecting the model (you may recall in another post I stated EAV is the bastard son of 6NF; well this is full 6NF, the undiluted father, and therefore provides all features of EAV, without sacrificing any Standards, integrity or Relational power); you merely add rows.

▶Monitor Statistics Data Model◀. (too large for inline; some browsers cannot load inline; click the link)

It allows me to produce these ▶Charts Like This◀, six keystrokes after receiving a raw monitoring stats file from the customer. Notice the mix-and-match; OS and server on the same chart; a variety of Pivots. (Used with permission.)

Readers who are unfamiliar with the Standard for Modelling Relational Databases may find the ▶IDEF1X Notation◀ helpful.

6NF Data Warehouse

This has been recently validated by Anchor Modelling, in that they are now presenting 6NF as the "next generation" OLAP model for data warehouses. (They do not provide the OLTP and OLAP from the single version of the data, that is ours alone).

Data Warehouse (Only) Experience

My experience with Data Warehouses only (not the above 6NF OLTP-OLAP Databases), has been several major assignments, as opposed to full implementation projects. The results were, no surprise:

  • consistent with the science, Normalised structures perform much faster; are easier to maintain; and require less data synching. Inmon, not Kimball.

  • consistent with the magic, after I Normalise a bunch of tables, and deliver substantially improved performance via application of the laws of physics, the only people surprised are the magicians with their mantras.

Scientifically minded people do not do that; they do not believe in, or rely upon, silver bullets and magic; they use and hard work science to resolve their problems.

Valid Data Warehouse Justification

That is why I have stated in other posts, the only valid justification for a separate Data Warehouse platform, hardware, ETL, maintenance, etc, is where there are many Databases or "databases", all being merged into a central warehouse, for reporting and OLAP.

Kimball

A word on Kimball is necessary, as he is the main proponent of "de-normalised for performance" in data warehouses. As per my definitions above, he is one of those people who have evidently never Normalised in their lives; his starting point was un-normalised (camouflaged as "de-normalised") and he simply implemented that in a Dimension-Fact model.

  • Of course, to obtain any performance, he had to "de-normalise" even more, and create further duplicates, and justify all that.

    • So therefore it is true, in a schizophrenic sort of way, that "de-normalising" un-normalised structures, by making more specialised copies, "improves read performance". It is not true when the whole is taking into account; it is true only inside that little asylum, not outside.

    • Likewise it is true, in that crazy way, that where all the "tables" are monsters, that "joins are expensive" and something to be avoided. They have never had the experience of joining smaller tables and sets, so they cannot believe the scientific fact that more, smaller tables are faster.

    • they have experience that creating duplicate "tables" is faster, so they cannot believe that eliminating duplicates is even faster than that.

  • his Dimensions are added to the un-normalised data. Well the data is not Normalised, so no Dimensions are exposed. Whereas in a Normalised model, the Dimensions are already exposed, as an integral part of the data, no addition is required.

  • that well-paved path of Kimball's leads to the cliff, where more lemmings fall to their deaths, faster. Lemmings are herd animals, as long as they are walking the path together, and dying together, they die happy. Lemmings do not look for other paths.

All just stories, parts of the one mythology that hang out together and support each other.

Your Mission

Should you choose to accept it. I am asking you to think for yourself, and to stop entertaining any thoughts that contradict science and the laws of physics. No matter how common or mystical or mythological they are. Seek evidence for anything before trusting it. Be scientific, verify new beliefs for yourself. Repeating the mantra "de-normalised for performance" won't make your database faster, it will just make you feel better about it. Like the fat kid sitting in the sidelines telling himself that he can run faster than all the kids in the race.

  • on that basis, even the concept "normalise for OLTP" but do the opposite, "de-normalise for OLAP" is a contradiction. How can the laws of physics work as stated on one computer, but work in reverse on another computer ? The mind boggles. It is simply not possible, the work that same way on every computer.

Questions ?

MySQL: What is a page?

7 votes

I can't for the life of me remember what a page is, in the context of a MySQL database. When I see something like 8KB/page, does that mean 8KB per row or ...?

Database pages are the internal basic structure to organize the data in the database files. Here some informations about the InnoDB model.

From 13.2.11.2. File Space Management:

The data files that you define in the configuration file form the InnoDB tablespace. The files are logically concatenated to form the tablespace. [...] The tablespace consists of database pages with a default size of 16KB. The pages are grouped into extents of size 1MB (64 consecutive pages). The “files” inside a tablespace are called segments in InnoDB.

And from 13.2.14. Restrictions on InnoDB Tables

The default database page size in InnoDB is 16KB. By recompiling the code, you can set it to values ranging from 8KB to 64KB.

Further, to put rows in relation to pages:

The maximum row length, except for variable-length columns (VARBINARY, VARCHAR, BLOB and TEXT), is slightly less than half of a database page. That is, the maximum row length is about 8000 bytes. LONGBLOB and LONGTEXT columns must be less than 4GB, and the total row length, including BLOB and TEXT columns, must be less than 4GB.

Program Structure -- Simple Commandline To Do List App -- What's the Haskell way?

6 votes

Background: I am working on my first complete program in Haskell, a simple to do list application on the command line.

My question is one of structure and has two levels: (1) What is the best way to do it? and (2) What is the Haskell (functional) way to do it? The reason I phrase it this way is because I suspect that there may be a faster or easier way to do it that ignores the principles of functional programming. I'd like to do it in the more elegant and clear way, as it is more of a learning exercise than anything else.

Keep in mind that I'd (obviously) like the application to be somewhat persistent. The two options on the table right now are to store information in a flat text file or alternately in a Sqlite database.

The first structure that came to mind was something like this, where a ToDoList type object is simply a list of ToDo items:

import Data.List
import Data.Time

data ToDo = ToDo {
        todoId       :: Int,
        todoDue      :: ZonedTime,
        todoCreated  :: UTCTime,
        todoItem     :: String,
        todoPriority :: Priority,
        todoStatus   :: Status
        }
        deriving (Show, Read)

type ToDoList = [ToDo]

data Priority = Low | Medium | High
        deriving (Show, Read, Eq, Ord)

data Status = Complete | InProgress | Open
        deriving (Show, Read, Eq, Ord)

But then I began to wonder how best to store objects of this type. Would this imply that I store them in a flat file? Is there some way to relate objects of highly specified types like this to fields/columns in a database?

When I think about using a Sqlite database, it seems that all of the work will be done in the database calls, and Haskell types will have comparatively little to do with it. This seems bad.

In summary, the question is how best can I model the data structure of my simple to do list application in keeping with the notions of functional programming and Haskell ideals that I am seeking to practice with this project?

The Show/Read combination is a tremendously easy way of serializing and deserializing the internal state of your application, and because of purity it basically always works too. Furthermore, you'll get good practice writing functions that slice and dice lists, since you'll be able to treat the list as if it is entirely loaded into memory (and maybe if you want to play with some more efficient data structures, you can look at ways to optimize different queries.)

For example, if I want to find all items that are due before some date, I can write this using a filter on due:

dueBefore (ToDoList ts) d = ToDoList (filter (\t -> due t <= d) ts)

Some style nitpicks on your pseudocode:

  • Since all accessor functions (id, die. created...) get dumped into the module-wide namespace, it's good style to prefix/suffix them with the name of the record, such as todoId, todoDie. In this particular case, id is a real function, so you shouldn't shadow it!

  • ToDoList is a single constructor datatype with one value; you probably actually just want a newtype or type synonym. (Exercise: rewrite the above snippet to work with a type synonym.)

  • You probably want an Ord instance on Priority and Status

Load a page of data at a time

6 votes

I've got a data grid with pages with more than 10k rows so its very slow when it first loads. whats the best way of solving this problem. I've read that JDBC paging is the usual solution for such problem but some people are saying to use SQL ROWNUM is an easier solution so I wanted to ask first.

If you think paging is the best solution could you please give me a couple of pointers on how to go on about it(link to implimentation etc)

Check this link about pagination Queries in Oracle.

http://www.oracle.com/technetwork/issue-archive/2007/07-jan/o17asktom-093877.html

Based on the two input parameters ( (1) Page number and (2) number of results to be displayed per page), you can use the query to get the results you need.

Also check the query that Don Roby pointed to above. It is just as good, but I wanted to point you to more oracle specific information and how it is processed.

Opinions on sensor / reading / alert database design

5 votes

I've asked a few questions lately regarding database design, probably too many ;-) However I beleive I'm slowly getting to the heart of the matter with my design and am slowly boiling it down. I'm still wrestling with a couple of decisions regarding how "alerts" are stored in the database.

In this system, an alert is an entity that must be acknowledged, acted upon, etc.

Initially I related readings to alerts like this (very cut down) : -

[Location]
LocationId

[Sensor]
SensorId
LocationId
UpperLimitValue
LowerLimitValue

[SensorReading]
SensorReadingId
Value
Status
Timestamp

[SensorAlert]
SensorAlertId

[SensorAlertReading]
SensorAlertId
SensorReadingId

The last table is associating readings with the alert, because it is the reading that dictate that the sensor is in alert or not.

The problem with this design is that it allows readings from many sensors to be associated with a single alert - whereas each alert is for a single sensor only and should only have readings for that sensor associated with it (should I be bothered that the DB allows this though?).

I thought to simplify things, why even bother with the SensorAlertReading table? Instead I could do this:

[Location]
LocationId

[Sensor]
SensorId
LocationId

[SensorReading]
SensorReadingId
SensorId
Value
Status
Timestamp

[SensorAlert]
SensorAlertId
SensorId
Timestamp

[SensorAlertEnd]
SensorAlertId
Timestamp

Basically I'm not associating readings with the alert now - instead I just know that an alert was active between a start and end time for a particular sensor, and if I want to look up the readings for that alert I can do.

Obviously the downside is I no longer have any constraint stopping me deleting readings that occurred during the alert, but I'm not sure that the constraint is neccessary.

Now looking in from the outside as a developer / DBA, would that make you want to be sick or does it seem reasonable?

Is there perhaps another way of doing this that I may be missing?

Thanks.

EDIT: Here's another idea - it works in a different way. It stores each sensor state change, going from normal to alert in a table, and then readings are simply associated with a particular state. This seems to solve all the problems - what d'ya think? (the only thing I'm not sure about is calling the table "SensorState", I can't help think there's a better name (maybe SensorReadingGroup?) : -

[Location]
LocationId

[Sensor]
SensorId
LocationId

[SensorState]
SensorStateId
SensorId
Timestamp
Status
IsInAlert

[SensorReading]
SensorReadingId
SensorStateId
Value
Timestamp

There must be an elegant solution to this!

Revised 01 Jan 11 21:50 UTC

Data Model

I think your Data Model should look like this:▶Sensor Data Model◀. (Page 2 relates to your other question re History).

Readers who are unfamiliar with the Relational Modelling Standard may find ▶IDEF1X Notation◀ useful.

Business (Rules Developed in the Commentary)

I did identify some early business Rules, which are now obsolete, so I have deleted them

These can be "read" in the Relations (read adjacent to the Data Model). The Business Rules and all implied Referential and Data Integrity can be implemented in, and thus guaranteed by, RULES, CHECK Constraints, in any ISO SQL database. This is a demonstration of IDEF1X, in the development of both the Relational keys, and the Entities and Relations. Note the Verb Phrases are more than mere flourish.

Apart from three Reference tables, the only static, Identifying entities are Location, NetworkSlave, and User. Sensor is central to the system, so I ahve given it its own heading.

Location

  • A Location contains one-to-many Sensors
  • A Location may have one Logger

NetworkSlave

  • A NetworkSlave collects Readings for one-to-many NetworkSensors

User

  • An User may maintain zero-to-many Locations
  • An User may maintain zero-to-many Sensors
  • An User may maintain zero-to-many NetworkSlaves
  • An User may perform zero-to-many Downloads
  • An User may make zero-to-many Acknowledgements, each on one Alert
  • An User may take zero-to-many Actions, each of one ActionType

Sensor

  • A SensorType is installed as zero-to-many Sensors

  • A Logger (houses and) collects Readings for one LoggerSensor

  • A Sensor is either one NetworkSensor or one LoggerSensor

    • A NetworkSensor records Readings collected by one NetworkSlave
      .
  • A Logger is periodically Downloaded one-to-many times
    • A LoggerSensor records Readings collected by one Logger
      .
  • A Reading may be deemed in Alert, of one AlertType
    • An AlertType may happen on zero-to-many Readings
      .
  • An Alert may be one Acknowledgement, by one User .
  • An Acknowledgement may be closed by one Action, of one ActionType, by one User
    • An ActionType may be taken on zero-to-many Actions

Responses to Comments

  1. Sticking Id columns on everything that moves, interferes with the determination of Identifiers, the natural Relational keys that give your database relational "power". They are Surrogate Keys, which means an additional Key and Index, and it hinders that relational power; which results in more joins than otherwise necessary. Therefore I use them only when the Relational key becomes too cumbersome to migrate to the child tables (and accept the imposed extra join).

  2. Nullable keys are a classic symptom of an Unnormalised database. Nulls in the database is bad news for performance; but Nulls in FKs means each table is doing too many things, has too many meanings, and results is very poor code. Good for people who like to "refactor" their databases; completely unnecessary for a Relational database.

  3. Resolved: An Alert may be Acknowledged; An Acknowledgement may be Actioned.

  4. The columns above the line are the Primary Key (refer Notation document). SensorNo is a sequential number within LocationId; refer Business Rules, it is meaningless outside a Location; the two columns together form the PK. When you are ready to INSERT a Sensor (after you have checked that the attempt is valid, etc), it is derived as follows. This excludes LoggerSensors, which are zero:

    INSERT Sensor VALUES (
        @LocationId,
        SensorNo = ( SELECT ISNULL(MAX(SensorNo), 0) + 1
            FROM Sensor
            WHERE LocationId = @LocationId
            )
        @SensorCode
        )

  5. For accuracy or improved meaning, I have changed NetworkSlave monitors NetworkSensor to NetworkSlave collects Readings from NetworkSensor.

  6. Check Constraints. The NetworkSensor and LoggerSensor are exclusive subtypes of Sensor, and their integrity can be set by CHECK constraints. Alerts, Acknowledgements and Actions are not subtypes, but their integrity is set by the same method, so I will list them together.

    • Every Relation in the Data Model is implemented as a CONSTRAINT in the child (or subtype) as FOREIGN KEY (child_FK_columns) REFERENCES Parent (PK_columns)

    • A Discriminator is required to identify which subtype a Sensor is. This is SensorNo = 0 for LoggerSensors; and non-zero for NetworkSensors.

    • The existence of NetworkSensors and LoggerSensors are constrained by the FK CONSTRAINTS to NetworkSlave and Logger, respectively; as well as to Sensor.
    • In NetworkSensor, include a CHECK constraint to ensure SensorNo is non-zero
    • In LoggerSensor, include a CHECK constraint to ensure SensorNo is zero

    • The existence of Acknowledgements and Actions are constrained by the identified FK CONSTRAINTS (An Acknowledgement cannot exist without an Alert; an Action cannot exist without an Acknowledgement). Conversely, an Alert with no Acknowledgement is in an unacknowledged state; an Alert with and Acknowledgementbut no Action is in an acknowledged but un-actioned state. .

  7. Alerts. The concept in a design for this kind of (live monitoring and alert) application is many small programs, running independently; all using the database as the single version of the truth. Some programs insert rows (Readings, Alerts); other programs poll the db for existence of such rows (and send SMS messages, etc; or hand-held units pick up Alerts relevant to the unit only). In that sense, the db is a may be described as an message box (one program puts rows in, which another program reads and actions).

    The assumption is, Readings for Sensors are being recorded "live" by the NetworkSlave, and every minute or so, a new set of Readings is inserted. A background process executes periodically (every minute or whatever), this is the main "monitor" program, it will have many functions within its loop. One such function will be to monitor Readings and produce Alerts that have occurred since the last iteration (of the program loop).

    The following code segment will be executed within the loop, one for each AlertType. It is a classic Projection:

    -- Assume @LoopDateTime contains the DateTime of the last iteration
    INSERT Alert
        SELECT LocationId,
               SensorNo,
               ReadingDtm,
               "L"          -- AlertType "Low"
            FROM Sensor  s,
                 Reading r
            WHERE s.LocationId = r.LocationId
            AND   s.SensorNo   = r.SensorNo
            AND   r.ReadingDtm > @LoopDtm
            AND   r.Value      < s.LowerLimit
    INSERT Alert
        SELECT LocationId,
               SensorNo,
               ReadingDtm,
               "H"          -- AlertType "High"
            FROM Sensor  s,
                 Reading r
            WHERE s.LocationId = r.LocationId
            AND   s.SensorNo   = r.SensorNo
            AND   r.ReadingDtm > @LoopDtm
            AND   r.Value      > s.UpperLimit
    So an Alert is definitely a fact, that exists as a row in the database. Subsequently that may be Acknowledged by an User (another row/fact), and Actioned with an ActionType by an User.

    Other that this (the creation by Projection act), ie. the general and unvarying case, I would refer to Alert only as a row in Alert; a static object after creation.

  8. Concerns re Changing Users. That is taken care of already, as follows. At the top of my (revised yesterday) Answer, I state that the major Identifying elements are static. I have re-sequenced the Business Rules to improve clarity.

    • For the reasons you mention, User.Name is not a good PK for User, although it remains an Alternate Key (Unique) and the one that is used for human interaction.

    • User.Name cannot be duplicated, there cannot be more than one Fred; there can be in terms of FirstName-LastName; two Fred Bloggs, but not in terms of User.Name. Our second Fred needs to choose another User.Name. Note the identified Indices.

    • UserId is the permanent record, and it is already the PK. Never delete User, it has historical significance. In fact the FK constraints will stop you (never use CASCADE in a real database, that is pure insanity). No need for code or triggers, etc.

    • Alternately (to delete Users who never did anything, and thus release User.Name for use) allow Delete as long as there are no FK violations (ie. UserId is not referenced in Download, Acknowledgement, Action).

    To ensure that only Users who are Current perform Actions, add an IsObsolete boolean in User (DM Updated), and check that column when that table is interrogated for any function (except reports) You can implement a View UserCurrent which returns only those Users.

    Same goes for Location and NetworkSlave. If you need to differentiate current vs historical, let me know, I will add IsObsolete to them as well.

    I don't know: you may purge the database of ancient Historical data periodically, delete rows that are (eg) over 10 years old. That has to be done from the bottom (tables) first, working up the Relations.

Feel free to ask Questions.

Note the IDEF1 Notation document has been expanded.

looking for a stand-alone, in-memory data server with sequential access

5 votes

We need an in-memory data structure / DB server with following characteristics:

  1. stand-alone server: will run on the same machine as the clients, so any kind of IPC is ok
  2. sequential access: get next/previous key
  3. two keys (string) per record: kind of a bi-directional map, actually
  4. in-memory only: should have option for no persistence at all
  5. RAM disk/tmpfs solution is not desirable
  6. SQL/ODBC is an option, although not required
  7. commercial product ok, if the OEM license price is reasonable

So far we have considered the following options, but no satisfactory solution yet:

  • mysql: answers all requirements, but now answer yet as to how much an OEM license will cost
  • memcached, cassandra: no sequential access, according to online docs
  • Redis: seems as a nice tool, but again, I don't see get get nex/prev in the spec
  • Postgres: couldn't make it work well on Linux/ODBC

Seems that all nosql in-memory DBs provide only direct access by key, hash-table style. Have not checked Apache Derby yet.

We run on Linux, the client is in C++.

Will be glad to have your suggestions. Thanks!

EDIT: It seems that we will be pursuing the mysql option, we've got a reasonable price offer, and it will also be useful as our main (persistent) DB. Thanks for all answers, it is as usual difficult to select just one.

I'd suggest Kyoto Cabinet, it has an in memory function and sequential access.

edit: what's wrong with stl map?

Best way to store working hours and query it efficiently

5 votes

I'm planning to store working hours for shops. I'm wondering what might be the best modeling for the working hours field so that I can get a list of open/closed shops at the current moment in very efficient way.

To store normal operation hours, you would need to store a number of records containing:

  • Shop - INTEGER
  • DayOfWeek - INTEGER (0-6)
  • OpenTime - TIME
  • CloseTime - TIME

I assume for example that each shop has reduced hours during national holidays, or has plant shutdowns, so you would also need to store some override records:

  • Shop - INTEGER
  • OverrideStartDate - DATE
  • OverrideEndDate - DATE
  • DayOfWeek - INTEGER (0-6)
  • AltOpenTime - TIME
  • AltCloseTime - TIME
  • Closed - INTEGER (0, 1)

To find open shops is trivial, but you also need to check if there are override hours:

SELECT Shop
FROM OverrideHours
WHERE OverrideStartDate <= NOW()
AND OverrideEndDate >= NOW()
AND DayOfWeek = WEEKDAY(NOW())

If there are any record returned, those shops have alternate hours or are closed.

There may be some nice SQL-fu you can do here, but this gives you the basics.

EDIT

I haven't tested this, but this should get you close:

SELECT Normal.Shop
FROM Normal
LEFT JOIN Override
ON Normal.Shop = Override.Shop
AND Normal.DayOfWeek = Override.DayOfWeek
AND NOW() BETWEEN Override.OverrideStartDate AND Override.OverrideEndDate
WHERE Normal.DayOfWeek = WEEKDAY(NOW())
AND ((Override.Shop IS NULL AND TIME(NOW()) BETWEEN Normal.OpenTime AND Normal.CloseTime)
 OR  (Override.Shop IS NOT NULL AND Override.Closed <> 1 AND TIME(NOW()) BETWEEN Override.AltOpenTime AND Override.AltCloseTime))

EDIT

As for efficiency, it is efficient in the sense that you only have to make one call to MySQL which is often a bottleneck if it is across a network. You'll have to test and see whether this performs to your specifications. If not, you may be to play with some indices.

EDIT

Testing. Not complete testing, but some.

mysql> select * from Normal;
+------+-----------+----------+-----------+
| Shop | DayOfWeek | OpenTime | CloseTime |
+------+-----------+----------+-----------+
|    1 |         1 | 09:00:00 | 17:00:00  | 
|    1 |         5 | 09:00:00 | 16:00:00  | 
|    2 |         1 | 09:00:00 | 17:00:00  | 
|    2 |         5 | 09:00:00 | 17:00:00  | 
+------+-----------+----------+-----------+
4 rows in set (0.01 sec)

mysql> select * from Override;
+------+-------------------+-----------------+-----------+-------------+--------------+--------+
| Shop | OverrideStartDate | OverrideEndDate | DayOfWeek | AltOpenTime | AltCloseTime | Closed |
+------+-------------------+-----------------+-----------+-------------+--------------+--------+
|    2 | 2010-12-01        | 2010-12-31      |         1 | 09:00:00    | 18:00:00     |      0 | 
|    2 | 2010-12-01        | 2010-12-31      |         5 | 09:00:00    | 18:00:00     |      0 | 
|    1 | 2010-12-01        | 2010-12-31      |         1 | 09:00:00    | 17:00:00     |      1 | 
+------+-------------------+-----------------+-----------+-------------+--------------+--------+
3 rows in set (0.00 sec)

mysql> SET @whenever = TIMESTAMP('2010-11-23 16:05');
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT WEEKDAY(@whenever);
+--------------------+
| WEEKDAY(@whenever) |
+--------------------+
|                  1 | 
+--------------------+
1 row in set (0.00 sec)

mysql> SELECT Normal.Shop FROM Normal LEFT JOIN Override ON Normal.Shop = Override.Shop AND Normal.DayOfWeek = Override.DayOfWeek AND @whenever BETWEEN Override.OverrideStartDate AND Override.OverrideEndDate WHERE Normal.DayOfWeek = WEEKDAY(@whenever) AND ((Override.Shop IS NULL AND TIME(@whenever) BETWEEN Normal.OpenTime AND Normal.CloseTime)  OR  (Override.Shop IS NOT NULL AND Override.Closed <> 1 AND TIME(@whenever) BETWEEN Override.AltOpenTime AND Override.AltCloseTime));
+------+
| Shop |
+------+
|    1 | 
|    2 | 
+------+
2 rows in set (0.00 sec)

mysql> SET @whenever = TIMESTAMP('2010-11-23 17:05');
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT Normal.Shop FROM Normal LEFT JOIN Override ON Normal.Shop = Override.Shop AND Normal.DayOfWeek = Override.DayOfWeek AND @whenever BETWEEN Override.OverrideStartDate AND Override.OverrideEndDate WHERE Normal.DayOfWeek = WEEKDAY(@whenever) AND ((Override.Shop IS NULL AND TIME(@whenever) BETWEEN Normal.OpenTime AND Normal.CloseTime)  OR  (Override.Shop IS NOT NULL AND Override.Closed <> 1 AND TIME(@whenever) BETWEEN Override.AltOpenTime AND Override.AltCloseTime));
Empty set (0.01 sec)

mysql> SET @whenever = TIMESTAMP('2010-12-25 16:05');
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT Normal.Shop FROM Normal LEFT JOIN Override ON Normal.Shop = Override.Shop AND Normal.DayOfWeek = Override.DayOfWeek AND @whenever BETWEEN Override.OverrideStartDate AND Override.OverrideEndDate WHERE Normal.DayOfWeek = WEEKDAY(@whenever) AND ((Override.Shop IS NULL AND TIME(@whenever) BETWEEN Normal.OpenTime AND Normal.CloseTime)  OR  (Override.Shop IS NOT NULL AND Override.Closed <> 1 AND TIME(@whenever) BETWEEN Override.AltOpenTime AND Override.AltCloseTime));
+------+
| Shop |
+------+
|    2 | 
+------+
1 row in set (0.00 sec)

mysql> SET @whenever = TIMESTAMP('2010-11-23 17:05');
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT WEEKDAY(@whenever);
+--------------------+
| WEEKDAY(@whenever) |
+--------------------+
|                  1 | 
+--------------------+
1 row in set (0.00 sec)

mysql> SELECT Normal.Shop FROM Normal LEFT JOIN Override ON Normal.Shop = Override.Shop AND Normal.DayOfWeek = Override.DayOfWeek AND @whenever BETWEEN Override.OverrideStartDate AND Override.OverrideEndDate WHERE Normal.DayOfWeek = WEEKDAY(@whenever) AND ((Override.Shop IS NULL AND TIME(@whenever) BETWEEN Normal.OpenTime AND Normal.CloseTime)  OR  (Override.Shop IS NOT NULL AND Override.Closed <> 1 AND TIME(@whenever) BETWEEN Override.AltOpenTime AND Override.AltCloseTime));
Empty set (0.00 sec)

How can I store HTML in a DB (SQLITE PYTHON)

5 votes

This is probably quite simple, but I can't get there...

How can I store html code in a SQLITE Data Base?

I'm using text as the data-type for the field in the DB (should it be blob??)

I'm getting weird errors (and changing erros with the same input, so I think it has something to do with escaping)

MY CODE:

con = sqlite.connect(bd)
cur = con.cursor()
temp=cur.execute ('SELECT * from posts').fetchall()
#temp[Z][1] = ID
#temp[Z][4] = URL
i=0
while i< len (temp):
    if temp[i][0]==None:
        try:
            html = urllib2.urlopen(str(temp[i][4])).read()
        except:
            html=None
        #sql = 'UPDATE posts SET html = "' + str(html) + '" WHERE  id = ' +  str(temp[i][1])
        #cur.execute( 'UPDATE posts SET html = ? WHERE  id = ?' ,(html,temp[i][1]) )
        cur.execute("UPDATE posts SET html = '" + str(html) + "' WHERE  id = " +  str(temp[i][1]))
        con.commit()
        print temp[i][4]
    i=i+1

The errors:

1 -

OperationalError: near "2": syntax error WARNING: Failure executing file: Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56) Type "copyright", "credits" or "license" for more information.

2-

ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

P.s. I would rather If it would be text (human readble) than blob, but if it's the easier way, I'm all for it.

Thanx

Try:

cur.execute(
    "UPDATE posts SET html = ? WHERE id = ?", (html ,temp[i][1]))

Use parameterized arguments to allow sqlite3 to escape the quotes for you. (It also helps prevent SQL injection.)

Regarding the ProgrammingError: html should be a unicode object, rather than a string object. When you open the url:

response=urllib2.urlopen(str(temp[i][4]))

Look at the content type header:

content_type=response.headers.getheader('Content-Type')
print(content_type)

It might say something like

'text/html; charset=utf-8'

in which case you should decode the html string with the utf-8 codec:

html = response.read().decode('utf-8')

This will make html a unicode object, and (hopefully) address the ProgrammingError.

Multiple Threads accessing the database: one with long transaction, one with short transactions

5 votes

Let's say I have a desktop app that acts as a garage for a bunch of cars:

@Entity
public class Garage {
    private List<Car> cars = new ArrayList<Car>();
    ...
}

The desktop app has a "simulation" button that starts a new thread and starts calling methods on the Garage, Car, Wheel, etc etc. This simulation can take as long as 10 minutes to run. At the moment I have a class that looks like this:

beginTransaction();
Garage garage = garageDao.findGarage(1);
List<Car> cars = garage.getCars();
for (Car car : cars) {
    // call methods on the car to lazily fetch other things like wheels...
}
commitTransaction();

This code only does "reads" and never "writes"

So the above can take a long time depending on how badly the cars need a service. While the above is happening, the user may continue working using the desktop app. They may choose to change the color of a car that is being used in the above transaction.

My question is whether the above long transaction is going to prevent the changing of the car color? i.e. the user changing the color of the car in the desktop app will be prevented from committing the change until the long transaction is finished?

Why should it? You're, by default, using optimistic transactions, so there is no locking to be applied to rows being read (unless you're not showing us some JPA2 lock() calls). Commit of the transaction should then check on optimistic version of the records (if you have a version defined) and use that to decide whether to commit the changes.

Why there can be only one TIMESTAMP column with CURRENT_TIMESTAMP in DEFAULT clause?

5 votes

Why there can be only one TIMESTAMP column with CURRENT_TIMESTAMP in DEFAULT or ON UPDATE clause?

CREATE TABLE `foo` (
  `ProductID` INT(10) UNSIGNED NOT NULL,
  `AddedDate` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `UpdatedDate` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
) ENGINE=INNODB;

The error that results:

Error Code : 1293

Incorrect table definition; there can be only one TIMESTAMP column with CURRENT_TIMESTAMP in DEFAULT or ON UPDATE clause

Hello!

I also wondered that long time ago. I searched a bit in my history and I think that this post: http://lists.mysql.com/internals/34919 represents the semi-official position of MySQL (before Oracle's intervention ;))

In short:

this limitation stems only from the way in which this feature is currently implemented in the server and there are no other reasons for its existence.

So their explanation is "because it is implemented like this". Doesn't sound very scientific. I guess it all comes from some old code. This is suggested in the thread above: "carry-over from when only the first timestamp field was auto-set/update".

Cheers!

Historical / auditable database

5 votes

Hi all, This question is related to the schema that can be found in one of my other questions here. Basically in my database I store users, locations, sensors amongst other things. All of these things are editable in the system by users, and deletable.

However - when an item is edited or deleted I need to store the old data; I need to be able to see what the data was before the change.

There are also non-editable items in the database, such as "readings". They are more of a log really. Readings are logged against sensors, because its the reading for a particular sensor.

If I generate a report of readings, I need to be able to see what the attributes for a location or sensor was at the time of the reading.

Basically I should be able to reconstruct the data for any point in time.

Now, I've done this before and got it working well by adding the following columns to each editable table:

valid_from
valid_to
edited_by

If valid_to = 9999-12-31 23:59:59 then that's the current record. If valid_to equals valid_from, then the record is deleted.

However, I was never happy with the triggers I needed to use to enforce foreign key consistency.

I can possibly avoid triggers by using the extension to the "PostgreSQL" database. This provides a column type called "period" which allows you to store a period of time between two dates, and then allows you to do CHECK constraints to prevent overlapping periods. That might be an answer.

I am wondering though if there is another way.

I've seen people mention using special historical tables, but I don't really like the thought of maintainling 2 tables for almost every 1 table (though it still might be a possibility).

Maybe I could cut down my initial implementation to not bother checking the consistency of records that aren't "current" - i.e. only bother to check constraints on records where the valid_to is 9999-12-31 23:59:59. Afterall, the people who use historical tables do not seem to have constraint checks on those tables (for the same reason, you'd need triggers).

Does anyone have any thoughts about this?

PS - the title also mentions auditable database. In the previous system I mentioned, there is always the edited_by field. This allowed all changes to be tracked so we could always see who changed a record. Not sure how much difference that might make.

Thanks.

Revised 01 Jan 11

Ok, so there is a gap between where I sit (deliver fully auditable databases; yours being a particular requirement of that) and where you sit: based on your questions and comments. Which we will probably work out in the commentary. Here's a position to start from.

  • To provide this requirement, there is no need at all for: triggers; mass duplication; broken integrity; etc.

  • This is not a classic Temporal requirement, either, so no need for the "period" capability, but you can.

  • ValidFrom and ValidTo is a Normalisation error: the ValidTo is data that is easily derived; ValidTo in any row is duplicated, in the ValidFrom of the next row; you have an Update Anomaly (when you update one column in one row, you additionally have to update the other column in the next row); you have to use a dummy value for "current".

    • All unnecessary, use ValidFrom only, and keep the db clean and pure 5NF.

    • The Caveat is, if PostgreSQL can't perform Subqueries without falling in a heap (ala Oracle), then fine, kep ValidTo.

All of these things are editable in the system by users, and deletable.

Well, no. It is a database holding important information; with Referential Integrity, not a scratchpad, so the user cannot just walk up to it and "delete" something. It will contradict the same users requirement for maintaining historical data (in the Reading; Alert; Ack; Action; Download).

  • Cascading deletes are not allowed. Those functions are check boxes for non-databases, MS Access types. For real databases, the RI constraints stop parents with children from being deleted.

  • Primary Keys cannot (should not) be changed. Eg. UserId; LocationId; NetworkSlaveCode never change; remember, they are carefully considered Identifiers. One characteristic of PKs is that they are stable.

  • You can add new Users; you can change a current User's name; but you cannot delete an User who has entries in Download, Acknowledgement, Action.

Basically if it's editable then it has to be historical (so that excludes readings and alerts).

Also excludes: Downloads; Acknowledgements; Actions.

And the Reference tables: SensorType; AlertType; ActionType.

And the new History tables: they are inserted into, but they cannot be updated or deleted.

The problem I find with the isObselete flag is.. Say if you change the Location, the Sensor foreign key will now point to an obselete record, meaning you will have to duplicate every sensor record. This problem gets exponentially worse as the hierachy gets bigger.

  • Ok, so now do you understand the LocationId (FK) in Sensor will not change; there is no mass duplication, etc ? There is no problem in the first place (and there is in that stupid book!) that gets exponentially worse in the second place.

  • IsObsolete is inadequate for your requirement. (Refer below)

  • The UpdatedDtm in any real row (Reading, etc) identifies the Parent (FK to Sensor) History row (its AuditedDtm) that was in effect at the time.

  • Full Relational capability; Declarative Refential Integrity, etc.

  • Maintain the IDEF1X, Relational concept of strong Identifiers ... There is only one Current parent row (eg. Location)

  • The rows in the History are Images of the current row, before it was changed, at the stated AuditedDtm. The Current row (non-history) shows the one last UpdatedDtm, when the row was changed.

  • The AuditedDtm shows the entire series of UpdatedDtms for any given key; and thus I have used it to "partition" the real key in a temporal sense.

All that is required is a History table for each changeable table. I have provided the Hiistory tables for four Identifying tables: Location; Sensor; NetworkSlave; and User.

Please read this for understanding Auditable in the accounting sense.

Data Model

Link to Sensor Data Model with History (Page 2 contains the History tables and context).

Readers who are not familiar with the Relational Modelling Standard may find IDEF1X Notation useful.

Response to Comments

(1) My first issue is that of referential integrity with the historic data, in that I'm not sure there is any, and if there is I'm not sure how it works. For instance, in SensoryHistory it would be possible to add a record that had an UpdatedDtm indicating a date time before the location itself existed, if you see what I mean. Whether this is actually an issue I'm not sure - enforcing that might be over the top.

(You raised a similar issue in the other question.) It may be that the dbs you have experienced did not actually have the Referential Integrity in place; that the Relation lines were there just for documentation; that the RI was "implemented in app code" (which means there is no RI).

This is an ISO/IEC/ANSI Standard SQL database. That allows Declarative Referential Integrity. Every Relation line is implemented as a PK::FK Reference, an actual Constraint that is Declared. Eg:

CREATE TABLE Location
    ...
    CONSTRAINT UC_PK
        PRIMARY KEY (LocationId)
    ...
CREATE TABLE Sensor
    ...
    CONSTRAINT UC_PK
        PRIMARY KEY (LocationId, SensorNo)
    CONSTRAINT Location_Sensor_fk
        FOREIGN KEY (LocationId)
        REEFERENCES Location(LocationId)
    ...
CREATE TABLE SensorHistory
    ...
    CONSTRAINT UC_PK
        PRIMARY KEY (LocationId, SensorNo, UpdatedDtm))
    CONSTRAINT Sensor_SensorHistory_fk
        FOREIGN KEY (LocationId, SensorNo)
        REEFERENCES Sensor (LocationId, SensorNo)
    ...
Those Declared Constraints are enforced by the server; not via triggers; not in app code. That means:

  • A Sensor with a LocationId that does not exist in Location cannot be inserted
  • A LocationId in Location that has rows in Sensor cannot be deleted
  • A SensorHistory with a LocationId+SensorNo that does not exist in Sensor cannot be inserted
  • A LocationId+SensorNo in Sensor that has rows in SensorHistory cannot be deleted.

(1.1) All columns should have RULEs and CHECK Constraints to Constrain their range of values. That in addition to the fact that all INSERT/UPDATE/DELETEs are programmatic, within stored procs, therefore accidents do not happen, and people do not walk up to the database and run commands against it (excepts SELECTS).

Generally I stay away from triggers. If you are using stored procs, and the normal permissions, then this:

in SensoryHistory it would be possible to add a record that had an UpdatedDtm indicating a date time before the Location itself existed, if you see what I mean

is prevented. So is inserting a SensorHistory with an UpdatedDtm earlier than the Sensor itself. But procs are not Declarative Rules. However if you want to be doubly sure (and I mean doubly, because the INSERTS are all via a proc, direct command by users), then sure, you have to use a trigger. For me, that is over the top.

(2) how do I indicate deletion? I could just add a flag to the non-historical version of the table I guess.

Not sure yet. Eg. Do you accept that when a Sensor is deleted, it is final ... (yes, history is maintained) ... and then when a new Sensor is added to the Location, it will have a new SensorNo ... there is no Sensor being logically replaced with the new one, with or without a gap in time ?

From a end-user's point of view, via the software they should be able to add, edit and delete sensors at will with no limitation. But yes, once deleted it is deleted and cannot be undeleted. There's nothing to stop them re-adding a sensor later though with the exact same parameters.

And "delete" Locations, NetworkSlaves, and Users as well.

Ok. Then the new Sensor with the same parameters, is truly new, it has a new SensorNo, and is independent of any previous logical Sensor. We can add an IsObsolete BOOLEAN to the four identifying tables; it is now identified as adequate. The Delete is now a Soft Delete.

(2.1) For NetworkSensor and LoggerSensor, which are actually dependent on two parents: they are obsolete if either of their parents are obsolete. So there is no point giving them an IsObsolete column, which has a dual meaning, which can be derived from the applicable parent.

(2.2) Just to be clear, users cannot delete any rows from any Transaction and History tables, right?

(3) When updating a table, what method would be best to insert the new row in the historical table and update the main table? Just normal SQL statements inside a transaction maybe?

Yes. That is the classic use of a Transaction, as per ACID Properties, it is Atomic; it either succeeds in toto or fails in toto (to be retried later when the problem is fixed).

(4) Referenced Book

The definitive and seminal text is Temporal Data and the Relational Model C J Date, H Darwen, N A Lorentzos. As in, those of us who embrace the RM are familiar with the extensions, and what is required in the successor to the RM; rather than some other method.

The referenced book is horrible, and free. The PDF isn't a PDF (no search; no indexing). Opening my MS and Oracle is telling; a few good bits couched in lots of fluff. Many misrepresentations. Not worth responding to in detail (if you want a proper review, open a new question).

(4.1) ValidTo in addition to ValidFrom. Serious mistake (as identified at the top of my answer) which the book makes; then laboriously solves. Don't make the mistake in the first place, and you have nothing to solve in the second place. As I understand it, that will eliminate your triggers.

(4.2) Simple rules, taking both Normalisation and Temporal requirements into account. First and foremost, you need to deeply understand (a) the temporal requirement and (b) the DataTypes, correct usage and limitations. Always store:

  • Instant as DATETIME, eg. UpdatedDtm

  • Interval as INTEGER, clearly identifying the Unit in the column name, eg. IntervalSec

  • Period. Depends on conjunct or disjunct.

    • For conjunct, which this requirement is, (4.1) applies: use one DATETIME; the end of the period can be derived from the beginning of the period of the next row.
    • For disjunct periods, yes, you need 2 x DATETIMEs, eg, RentedFrom and a RentedTo with gaps in-between.

(4.3) They mess with the "Temporal Primary Key", which complicates code (in addition to requiring triggers to control the Update Anomaly). I have already delivered a clean (tried and tested) Temporal Primary Key.

(4.4) They mess with dummy values, non-real values, and Nulls for "Now". I do not allow such things in a database. Since I am not storing the duplicated ValidTo, I do not have the problem, there is nothing to solve.

(4.5) One has to wonder why a 528 page "textbook" is available free on the web, in poor PDF form.

(5) I [an User] could quiet happily delete all the LocationHistory rows for instance, (leaving only the current version in the Location table) - even though there may exist a SensorHistory row that conceptually "belongs" to a previous version of the Location, if that makes sense.

It does not make sense to me, there is still a gap in the communication we have to close. Please keep interacting until it is closed.

  • In a real (standard ISO/IEC/ANSI SQL) database, we do not GRANT INSERT/UPDATE/DELETE permission to users. We GRANT SELECT and REFERENCES only (to chosen users) All INSERT/UPDATE/DELETEs are coded in Transactions, which means stored procs. Then we GRANT EXEC on each stored proc to selected users (use ROLES to reduce administration).

    • Therefore no one can delete from any table without executing a proc.

    • Do not write a proc to delete from any History table. These rows should not be deleted. In this case, the non-permission and the non-existence of code is the Constraint.

    • Technically, all History rows are valid, there is no Period to concern yourself with. The oldest LocationHistory row contains the before-image of the original Location row before it was changed. The youngest LocationHistory rows is the before-image of the current Location row. Every LocationHistory row in-between is thusly valid and applies to the Period in-between.

    • No need to "prune" or look for a few LocationHistory rows that can be deleted on the basis that they apply to a Period that is not used: they are all used. (Definitively, without the need for checking for any mapping of Location children to any LocationHistory row(s), to prove it.)

    • Bottom line: an User cannot delete from any History (or Transaction) table.

    • Or do you mean something different again ?

    • Note I have added (1.1) above.

(6) Corrected one mistake in the DM. An Alert is an expression of Reading, not Sensor.

(7) Corrected the Business Rules in the other question/answer to reflect that; and the new rules exposed in this question.

(8) Do you understand/appreciate, that since we have a fully IDEF1X compliant model, re Identifiers:

  • The Identifiers are carried through the entire database, retaining their power. Eg. when listing Acknowledgements, they can be joined directly with Location and Sensor; the tables in-between do not have to be read (and they must be if Id keys are used). This is why there are in facts less joins required in a Relational Database (and more joins required in a unnormalised one).

  • the Subtypes, etc need to be navigated only when that particular context is relevant.

Object Technolgy / Object-Oriented Technology implemented in Relational Databases

5 votes

Hi, I am looking for some books/artices/blog post/discussions about "Object Techonolgy/Object-Oriented Technology implemented in Relational Database". I would like to have some more overview about history and how things are now (mainly MS SQL, DB2, etc).

Tank you.

This is my favourite article on Object/Relational Mapping. Valid when it was written, in 2006.

The Vietnam of Computer Science