Best database questions in March 2011

Working with Cross Context Joins in LINQ-to-SQL

9 votes

Initially I had written this query using LINQ-to-SQL

var result = from w in PatternDataContext.Windows
    join cf in PatternDataContext.ControlFocus on w.WindowId equals cf.WindowId
    join p in PatternDataContext.Patterns on cf.CFId equals p.CFId
    join r in ResultDataContext.Results on p.PatternId equals r.PatternId
    join fi in ResultDataContext.IclFileInfos on r.IclFileId equals fi.IclFileId
    join sp in sessionProfileDataContext.ServerProfiles on fi.ServerProfileId equals sp.ProfileId
    join u in infrastructure.Users on sp.UserId equals u.Id
    where w.Process.Equals(processName)
    select u.DistributedAppId;

And when I executed it, and saw result in the QuickWatch.., it showed this message:

the query contains references to items defined on a different data context

On googling, I found this topic at Stackoverflow itself, where I learned simulating cross context joins and as suggested there, I changed my query a bit to this:

var result = from w in PatternDataContext.Windows
    join cf in PatternDataContext.ControlFocus on w.WindowId equals cf.WindowId
    join p in PatternDataContext.Patterns on cf.CFId equals p.CFId
    join r in SimulateJoinResults() on p.PatternId equals r.PatternId
    join fi in SimulateJoinIclFileInfos() on r.IclFileId equals fi.IclFileId
    join sp in SimulateJoinServerProfiles() on fi.ServerProfileId equals sp.ProfileId
    join u in SimulateJoinUsers() on sp.UserId equals u.Id
    where w.Process.Equals(processName)
    select u.DistributedAppId;

This query is using these SimulateXyz methods:

private static IQueryable<Result> SimulateJoinResults()
{
  return from r in SessionDataProvider.Instance.ResultDataContext.Results select r;
}
private static IQueryable<IclFileInfo> SimulateJoinIclFileInfos()
{
  return from f in SessionDataProvider.Instance.ResultDataContext.IclFileInfos select f;
}
private static IQueryable<ServerProfile> SimulateJoinServerProfiles()
{
  return from sp in sessionProfileDataContext.ServerProfiles select sp;
}
private static IQueryable<User> SimulateJoinUsers()
{
  return from u in infrastructureDataContext.Users select u;
}

But even this approach didn't solve the problem. I'm still getting this message in QuickWatch...:

the query contains references to items defined on a different data context

Any solution for this problem? Along with the solution, I would also want to know why the problem still exists, and how exactly the new solution removes it, so that from next time I could solve such problems myself. I'm new to LINQ, by the way.

I've had to do this before, and there are two ways to do it.

The first is to move all the servers into a single context. You do this by pointing LINQ-to-SQL to a single server, then, in that server, create linked servers to all the other servers. Then you just create views for any tables you're interested from the other servers, and add those views to your context.

The second is to manually do the joins yourself, by pulling in data from one context, and using just the properties you need to join into another context. For example,

int[] patternIds = SessionDataProvider.Instance.ResultDataContext.Results.Select(o => o.patternId).ToArray();
var results = from p in PatternDataContext.Patterns
              where patternIds.Contains(p.PatternId)
              select p;

Though the first is easier to work with, it does have its share of problems. The problem is that you're relying on SQL Server to be performant with linked servers, something it is notoriously bad at. For example, consider this query:

var results = from p in DataContext.Patterns
              join r in DataContext.LinkedServerResults on p.PatternId equals r.PatternId
              where r.userId = 10;

When you enumerate this query, the following will occur (let's call the normal and linked servers MyServer and MyLinkedServer, respectively)

  1. MyServer asks MyLinkedServer for the Results
  2. MyLinkedServer sends the Results back to MyServer
  3. MyServer takes those Results, joins them on the Patterns table, and returns only the ones with Results.userId = 10

So now the question is: When is the filtering done - on MyServer or MyLinkedServer? In my experience, for such a simple query, it will usually be done on MyLinkedServer. However, once the query gets more complicated, you'll suddenly find that MyServer is requesting the entire Results table from MyLinkedServer and doing the filtering after the join! This wastes bandwidth, and, if the Results tables is large enough, could turn a 50ms query into a 50 second query!

You could fix unperformant cross-server joins using stored procedures, but if you do a lot of complex cross-server joins, you may end up writing stored procedures for most of your queries, which is a lot of work and defeats part of the purpose of using L2SQL in the first place (not having to write a lot of SQL).

In comparison, the following code would always perform the filtering on the server containing the Results table:

int[] patternIds = (from r in SessionDataProvider.Instance.ResultDataContext.Results
                    where r.userId = 10
                    select r.PatternId).ToArray();
var results = from p in PatternDataContext.Patterns
              where patternIds.Contains(p.PatternId)
              select p;

Which is best for your situation is up to your best judgement.


Note that there is a third potential solution which I did not mention, as it is not really a programmer-solution: you could ask your server admins to set up a replication task to copy the necessary data from MyLinkedServer to MyServer once a day/week/month. This is only an option if:

  • Your program can work with slightly stale data from MyLinkedServer
  • You only need to read, never write, to MyLinkedServer
  • The tables you need from MyLinkedServers are not exorbitantly huge
  • You have the space/bandwidth available
  • Your database admins are not stingy/lazy

What is the best way to keep this schema clear?

8 votes

Currently I'm working on a RFID project where each tag is attached to an object. An object could be a person, a computer, a pencil, a box or whatever it comes to the mind of my boss. And of course each object have different attributes.

So I'm trying to have a table tags where I can keep a register of each tag in the system (registration of the tag). And another tables where I can relate a tag with and object and describe some other attributes, this is what a have done. (No real schema just a simplified version)

enter image description here

Suddenly, I realize that this schema could have the same tag in severals tables. For example, the tag 123 could be in C and B at the same time. Which is impossible because each tag just could be attached to just a single object.

To put it simple I want that each tag could not appear more than once in the database.

My current approach enter image description here

What I really want enter image description here

Update: Yeah, the TagID is chosen by the end user. Moreover the TagID is given by a Tag Reader and the TagID is a 128-bit number.

New Update: The objects until now are:

-- Medicament(TagID, comercial_name, generic_name, amount, ...)

-- Machine(TagID, name, description, model, manufacturer, ...)

-- Patient(TagID, firstName, lastName, birthday, ...)

All the attributes (columns or whatever you name it) are very different.

Update after update

I'm working on a system, with RFID tags for a hospital. Each RFID tag is attached to an object in order keep watch them and unfortunately each object have a lot of different attributes.

An object could be a person, a machine or a medicine, or maybe a new object with other attributes.

So, I just want a flexible and cleaver schema. That allow me to introduce new object's types and also let me easily add new attributes to one object. Keeping in mind that this system could be very large.

Examples:

Tag(TagID)
Medicine(generic_name, comercial_name, expiration_date, dose, price, laboratory, ...)
Machine(model, name, description, price, buy_date, ...)
Patient(PatientID, first_name, last_name, birthday, ...)

We must relate just one tag for just one object.

Note: I don't really speak (or also write) really :P sorry for that. Not native speaker here.

You can enforce these rules using relational constraints. Check out the use of a persisted column to enforce the constraint Tag:{Pencil or Computer}. This model gives you great flexibility to model each child table (Person, Machine, Pencil, etc.) and at same time prevent any conflicts between tag. Also good that we dont have to resort to triggers or udfs via check constraints to enforce the relation. The relation is built into the model.

diagram

create table dbo.TagType (TagTypeID int primary key, TagTypeName varchar(10));
insert into dbo.TagType
    values(1, 'Computer'), (2, 'Pencil');

create table dbo.Tag
(   TagId       int primary key, 
    TagTypeId   int references TagType(TagTypeId), 
    TagName     varchar(10),
    TagDate     datetime,
    constraint UX_Tag unique (TagId, TagTypeId)
)
go
create table dbo.Computer 
(   TagId       int primary key, 
    TagTypeID   as 1 persisted,
    CPUType     varchar(25),
    CPUSpeed    varchar(25), 
    foreign key (TagId, TagTypeID) references Tag(TagId, TagTypeID)
)
go
create table dbo.Pencil 
(   TagId       int primary key, 
    TagTypeId   as 2 persisted,
    isSharp     bit,
    Color       varchar(25),
    foreign key (TagId, TagTypeID) references Tag(TagId, TagTypeId)
)
go



-----------------------------------------------------------
-- create a new tag of type Pencil:
-----------------------------------------------------------
insert into dbo.Tag(TagId, TagTypeId, TagName, TagDate)
    values(1, 2, 'Tag1', getdate());

insert into dbo.Pencil(TagId, isSharp, Color)
    values(1, 1, 'Yellow');

-----------------------------------------------------------
-- try to make it a Computer too (fails FK)
-----------------------------------------------------------
insert into dbo.Computer(TagId, CPUType, CPUSpeed)
    values(1, 'Intel', '2.66ghz')

What type of Join to use?

7 votes

I've got a core table and and 3 tables that extend the 'core' table in different ways.

I'm working with MLS data and I have a 'common' table that contains information common to all mls listings and then a table that has specifically "residential" information, one for "commercial",etc... I have been using mls number to join a single table when I know a listing when the property type is known, but for searching I want to join all of them and have the special fields available for search criteria (not simply searching the common table).

What type of join will give me a dataset that will contain all listings (including the extended fields in the idx tables) ?

For each Common table record there is a single corresponding record in ONLY ONE of the idx tables.

                     ___________
                    |           |
                    |  COMMON   |
                    |           |
                    |___________|
                         _|_   
                          |
       ___________________|_____________________
     _|_                 _|_                   _|_
 _____|_____         _____|______           ____|______ 
|           |       |            |         |           |
|   IDX1    |       |   IDX2     |         |   IDX3    |
|           |       |            |         |           |
|___________|       |____________|         |___________|

If you want everything in one row, you can use something like this format. Basically it gives you all the "Common" fields, then the other fields if there is a match otherwise NULL:

SELECT  Common.*,
        Idx1.*,
        Idx2.*,
        Idx3.*
FROM Common
LEFT JOIN Idx1
    ON Idx1.MLSKey = Common.MLSKey
LEFT JOIN Idx2
    ON Idx2.MLSKey = Common.MLSKey  
LEFT JOIN Idx3
    ON Idx3.MLSKey = Common.MLSKey

Bear in mind it's better to list out fields than to use the SELECT * whenever possible...

Also I'm assuming MySQL syntax is the same as SQL Server, which is what I use.

Web Development - Object db vs Relational db

7 votes

Whats the cons and pros of using a object database or relational database for regular web development which involves a lot of CRUD?

UPDATE: I reopened the bounty reward in order to give Neville it.

Relational database:

Pros:

  • Established technology - lots of tools, developers, resources
  • Wide range of Open Source and commercial products
  • Known to scale to very large sites, and very high throughput
  • Expresses many problem domains in a logical and "programmable" way
  • Fairly standard language (SQL)

Cons:

  • Impedance mismatch with OO concepts - modeling "inheritance" in a database is not natural
  • Hierarchical structures usually require vendor-specific extensions to the language
  • Non-relational data (e.g. documents) are not a natural fit
  • Changes in the business domain can be hard to implement once the schema has been defined

OOBDMS

Pros:

  • Closer fit for OO concepts
  • In theory, a developer only needs to work in one language - the persistence details are abstracted away. This should improve productivity

Cons:

  • Significantly fewer tools/resources/developers available.
  • No widely accepted standards
  • "black box" approach to persistence can make performance tuning difficult
  • persistence details often leak into the OO design (see Marcelo's examples)

Is ESQL used in industry ?

6 votes

I'm taking a database course and I have to write a command line application. The prof wants us to write an ESQL (embed SQL) application.

I have a feeling that this kind of technology is depreciated.

We have to use oracle precompiler to translate a esql code in c++. This kind of applications look terrible to maintain.

A php application would also work well, but they probably want a command line application to do the grading faster (unit test with input feed). What you guys think, is Embed SQL used in the industry, does it worth to ask the prof to do a java application ? Is there another technology more appropriate ?

Embedded SQL was one of the the most popular way to do SQL in C during the "old days" (C++ was not yet invented).

These days mostly we'll be using an ORM library. It is not recommended to do embedded SQL any more because, as you put it well, it depends on a proprietary pre-processor and makes code difficult to debug, manage, and maintain. It also hooks you to one single database vendor and your code will be extremely difficult to move to another database backend. Generally, we don't do it in "real life".

But as it is only a class, your prof is probably interested in teaching you SQL and database concepts. Embedded SQL is only a tool. You're supposed to learn SQL and databases, not embedded SQL in C++.

However, I believe that you're missing the point by asking about PHP and Java. Not to mention that PHP is a scripting language, and Java is another language that you can (potentially) write a processor for embedded SQL.

So your point about embedded SQL really has nothing with language choices. It has to do with the tradeoffs and balance between (1) proprietary embedded system with preprocessor, (2) using an ORM library, or a data-access library (e.g. ODBC).

Off-Topic:

I first started using embedded SQL when I was in College (that was about 30 years ago!). Actually got programming jobs out of College and still used it, but obviously it was on the way out. Never seen it used ever since 1990 or so.

Why not DbConnection instead of SqlConnection or OracleConnection?

5 votes

I'm a Java retread pretty new to C#. I'm hoping to stay out of trouble when I crank out a bunch of DML code in the next few weeks.

I'm used to the idea of using JDBC's abstract classes like Connection, Statement, and the like. C# offers similar abstract classes like DbConnection, DbCommand, and so forth, in the System.Data.Common namespace.

But, most of the examples I've seen -- both in MS documentation and other books -- use the concrete classes: SqlConnection, OracleCommand, etc. This kind of concreteness even shows up in the mySQL documentation.

What is the best practice in this area? Is there some strong reason to choose concrete table-server-specific rather than abstract classes for this purpose? (I'm aware of the hazards of downcasting abstract to concrete, of course).

The abstract classes were not part of the first versions of the framework, they were introduced in version 2.0. A lot of examples were written before that, or are based on examples that were written before that.

Using concrete or abstract classes is mostly a matter of taste. It's a nice idea to write code that could work with any database, but my experience is that you don't switch database systems very often, and if you do there are so many changes that you need to do that it doesn't matter much if you used abstract classes or not.

What concepts should I study to accomplish this?

5 votes

I am such a newbie to c# I have to ask questions before I can get started!

What I want to do is enter a numeric number into a text box, send it to an attached sql compact database, check to see if the number is in the table, if true return the data to the form. If false, I want to run some code that will get the information and update the table, add to table, send to the form. Other then creating sql tables via C#, Could someone help me prototype this concept so to speak so I can start reading up on the concepts so that I can start building this portion of my project? Thanks.

There are many different ways to do what you are describing. A quick and easy way to handle this scenario would be to use WPF for the user interface and LINQ to SQL for the database access. There are tons of tutorials on both technologies, but here are the basic steps:

Step 1: Create a new WPF project in Visual Studio
Step 2: Add a LINQ to SQL class and map it to your Database
Step 3: Edit the MainWindow.xaml and add the input textbox, check button, and results textbox

Sample code for MainWindow.xaml (note this is quick and dirty):

<Window x:Class="WPFPlayground.MainWindow"
    xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
    xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
    Title="MainWindow" Height="350" Width="525">
    <Grid>
        <StackPanel Orientation="Horizontal" Height="30">
            <TextBox Name="InputTextBox" Width="50"/>
            <Button Name="CheckButton" Content="Check DB" Click="CheckButton_Click"/> 
            <TextBox Name="ResultsTextBox" Width="100"/>
        </StackPanel>
    </Grid>
</Window>

Step 4: Edit the code behind of MainWindow.xaml.cs to handle the button click event

Sample code for Click event in MainWindow.xaml.cs (again quick and dirty)

private void CheckButton_Click(object sender, RoutedEventArgs e)
{
    // Get instance of my LINQ to SQL datacontext
    var db = new MyDbDataContext();

    // Try to get the record that matches the text in the InputTextBox
    var data = db.TableName.FirstOrDefault(r => r.Id == InputTextBox.Text);

    // Was able to find a matching record so show results data
    if (data != null)
    {
        ResultsTextBox.Text = data.EventDesc;
    }
    else 
    {
        // do what ever you need to do when there is no match
    }
}

Step 5: Learn some best practices and do not use this sample code :)

Have fun.

Does it make sense to use an OR-Mapper?

5 votes

Does it make sense to use an OR-mapper?

I am putting this question of there on stack overflow because this is the best place I know of to find smart developers willing to give their assistance and opinions.

My reasoning is as follows:

1.) Where does the SQL belong?

a.) In every professional project I have worked on, security of the data has been a key requirement. Stored Procedures provide a natural gateway for controlling access and auditing.

b.) Issues with Applications in production can often be resolved between the tables and stored procedures without putting out new builds.

2.) How do I control the SQL that is generated? I am trusting parse trees to generate efficient SQL. I have quite a bit of experience optimizing SQL in SQL-Server and Oracle, but would not feel cheated if I never had to do it again. :)

3.) What is the point of using an OR-Mapper if I am getting my data from stored procedures?

I have used the repository pattern with a homegrown generic data access layer. If a collection needed to be cached, I cache it. I also have experience using EF on a small CRUD application and experience helping tuning an NHibernate application that was experiencing performance issues. So I am a little biased, but willing to learn.

For the past several years we have all been hearing a lot of respectable developers advocating the use of specific OR-Mappers (Entity-Framework, NHibernate, etc...).

Can anyone tell me why someone should move to an ORM for mainstream development on a major project?

edit: http://www.codinghorror.com/blog/2006/06/object-relational-mapping-is-the-vietnam-of-computer-science.html seems to have a strong discussion on this topic but it is out of date.

Yet another edit: Everyone seems to agree that Stored Procedures are to be used for heavy-duty enterprise applications, due to their performance advantage and their ability to add programming logic nearer to the data.

I am seeing that the strongest argument in favor of OR mappers is developer productivity.

I suspect a large motivator for the ORM movement is developer preference towards remaining persistence-agnostic (don’t care if the data is in memory [unless caching] or on the database).

ORMs seem to be outstanding time-savers for local and small web applications.

Maybe the best advice I am seeing is from client09: to use an ORM setup, but use Stored Procedures for the database intensive stuff (AKA when the ORM appears to be insufficient).

I was a pro SP for many, many years and thought it was the ONLY right way to do DB development, but the last 3-4 projects I have done I completed in EF4.0 w/out SP's and the improvements in my productivity have been truly awe-inspiring - I can do things in a few lines of code now that would have taken me a day before.

I still think SP's are important for some things, (there are times when you can significantly improve performance with a well chosen SP), but for the general CRUD operations, I can't imagine ever going back.

So the short answer for me is, developer productivity is the reason to use the ORM - once you get over the learning curve anyway.

implementing Database-functionality objects in an N-Tier architecture?

5 votes

I'm adding functionality to our website which performs long-running processes asynchronously using MSMQ. Doing this ansynch, however means we need to notify users when their requests are completed. Using the command pattern, I created an interface* called INotify and composed that into the message class, so the message processing class can simply call GiveNotice() on the message's INotify object. The first implementation, EmailNotify, was more difficult than expected, as I was surprised to discover MailMessage isn't serializable, but got it going.

Now I'm working on a new concrete notifier, DBNotify, which will call a SP of some sort and update a status in the main transactional database. I'm tripped up in that I would like to reuse the DAL architecture we've already created, but INotify is a member of the Model project, which is more fundamental than the DAL.

Our hierarchy looks like this: Common > Model > DAL > BAL

Here's more detail about the tiers. Bear in mind, I inherited this from : Common is responsible for all "utility" functions which are used many places in the application, things like accessing configuration settings, parsing strings, non-business related functionality.

Model are business objects, what some folks call data transfer objects, collections of getters and setters. I've added some "smarts" at this layer, but only business rules internal to that object, such as "An item's name must begin with an alphanumeric character."

DAL is the data access layer, in theory, all that happens here is model objects are moved into and out of the database.

BAL is the Business layer; in theory, business rules that govern the interaction of objects are enforced (i.e. "A form must have at least two items.").

So the INotify interface is defined an abstraction to allow the method of notification to vary independently (i.e. email, TXT, twitter, etc). It's fundamental to the system, so I have created it at the Model tier, which is independent of the DAL tier. However, I am creating a new concrete implementation of INotify whose notification method is to call a SP in a database.

Has anyone else dealt with a business object whose purpose is to interact with a database, and how do you situate that in your N-tier architecture?

Before you tell me to use Linq to Sql, great thanks. This is not a technical question (how do I do this), it's a design question (how should I do this).

I think there is a StackExchange site more focused on these sorts of language-independant design questions, so I'm going to copy this there.

Thank you everyone for your input, there are several ideas here for improvements I plan to implement, although none directly answer the question I was asking.

I cross-posted this over to Programmers, where I think this sort of question may truly belong, and got some helpful ideas. If you're interested, the thread is here: Programmers thread on this issue. Admittedly, I added the "hint" of dependency injection based on my own research when I posted there, so the problem may have been clearer.

This is a great and helpful community, which I am so proud to participate in.

When we don't need a primary key for our table?

5 votes

Will it ever happen that we design a table that doesn't need a primary key?

No.

The primary key does a lot of stuff behind-the-scenes, even if your application never uses it.

For example: clustering improves efficiency (because heap tables are a mess).

Not to mention, if ANYONE ever has to do something on your table that requires pulling a specific row and you don't have a primary key, you are the bad guy.

Zend_Session_SaveHandler_DbTable is wiping the Session with every refresh?

4 votes

I'm basically encountering the same problem as the poster in this question. My database is initialized properly. I've tried doing the initialization of both the database and the session SaveHandler in the application.ini and in the Bootstrap. Same result no matter how I do it.

Here's what the application.ini initialization looks like:

resources.db.adapter = "pdo_mysql"
resources.db.params.host = "localhost"
resources.db.params.username = "uname"
resources.db.params.password = "******"
resources.db.params.dbname = "dbname"

resources.session.saveHandler.class = "Zend_Session_SaveHandler_DbTable"
resources.session.saveHandler.options.name = "sessions"
resources.session.saveHandler.options.primary = "sessionID"
resources.session.saveHandler.options.modifiedColumn = "lastModifiedTime"
resources.session.saveHandler.options.dataColumn = "data"
resources.session.saveHandler.options.lifetimeColumn = "lifetime"

And here's what the Bootstrap initialization looked like:

protected function _initSession() {
        $db = Zend_Db::factory('Pdo_Mysql', array(
            'host'        =>'localhost',
            'username'    => 'uname',
            'password'    => '******',
            'dbname'    => 'dbname'
        ));
        Zend_Db_Table_Abstract::setDefaultAdapter($db);


    $sessionConfig = array( 
            'name'           => 'sessions',      
            'primary'        => 'sessionID',   
            'modifiedColumn' => 'lastModifiedTime',     
            'dataColumn'     => 'data',
            'lifetimeColumn' => 'lifetime'
        ); 
        $saveHandler = new Zend_Session_SaveHandler_DbTable($sessionConfig); 
        Zend_Session::setSaveHandler($saveHandler); 
        Zend_Session::start();

}

My sessions database table is defined as follows:

create table sesssions (
    sessionID char(32) primary key not null, 
    lastModifiedTime timestamp, 
    lifetime timestamp, 
    data text
) engine=innodb;

I have a test action that tests this through a very simple one field form that just dumps its contents into the Session. The action looks like this:

public function addAction()
{
    $namespace = new Zend_Session_Namespace();

    $form = new Application_Form_AddToSession();
    $request = $this->getRequest();
    if ($request->isPost()) {
        if ($form->isValid($request->getPost())) {
           $namespace->content = $request->getParam('toAdd');
        }
    }
    $this->view->form = $form; 
}

Here's the form it uses:

class Application_Form_AddToSession extends Zend_Form
{

    public function init()
    {
        $this->setMethod('post');

        $this->addElement('text', 'toAdd', array(
            'filters'    => array('StringTrim', 'StringToLower'),
            'validators' => array(
                array('StringLength', false, array(0, 256)),
            ),
            'required'   => true,
            'label'      => 'Add:',
        ));

        $this->addElement('submit', 'add', array(
            'required' => false,
            'ignore'   => true,
            'label'    => 'Add',
        )); 
    }


}

The view just shows the form.

To test whether or not the value actually went into the session, I use the index action. This is the index action in question:

public function indexAction()
{
    $namespace = new Zend_Session_Namespace();
    echo 'Content: '.$namespace->content.'<br>';
    echo '<pre>'; print_r($_SESSION); echo '</pre>';
}

Now. If I don't have Session saving configured to use Zend_Session_SaveHandler_DbTable, ie, if I don't have session saving configured at all, then this works fine. I enter a value in the form field, go to the index action and have it output back to me. Session works exactly the way it is supposed to.

If I have Zend_Session_SaveHandler_DbTable configured in either the application.ini or the Bootstrap, then when I enter a value into the test field and go to the index action the value is gone. My database table has a row with the proper sessionID and the sessionID matches a cookie in my browser. But there is no other information in the database. data is NULL and both the TIMESTAMP fields are zeroed out.

I've run out of things to try. I've had the Mysql table as a regular table and an InnoDB table. I've tried every permutation of the database and session configuration I can come up with, including giving the db to the configuration array, and initializing one in the Bootstrap and the other in the .ini. I've scoured the web and StackOverflow for clues. I've seen other people post about similar problems, but none of the answers I've found have worked. What haven't I done? What have I screwed up? How can I make it work?

The problem is that you defined lastModifiedTime and lifetime columns as timestamp. They should be INT instead:

CREATE TABLE  `sessions` (
`sessionID` char(32) NOT NULL,
`lastModifiedTime` INT,
`lifetime` INT,
`data` text,
PRIMARY KEY (`sessionID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

After this small modification it should work.

Does the database (maximum) field length affect performance?

4 votes

At my company, we have a legacy database with various tables and therefore many, many fields. A lot of the fields seem to have large limits (ex: NVARCHAR(MAX)) that are never reached. Does arbitrarily making the fields their maximum width or 2 to 3 times larger than what is normally inputted negatively affect performance? How should one balance performance with field lengths? Is there a balance?

There's two parts to this question:

Does using NVARCHAR over VARCHAR hurt performance? Yes, storing data in unicode fields doubles the storage requirements. Your data stored in those fields is 2x the size it needs to be (until SQL Server 2008 R2 came out, which includes unicode compression. Your table scans will take twice as long and only half as much data can be stored in memory in the buffer cache.

Does using MAX hurt performance? Not directly, but when you use VARCHAR(MAX), NVARCHAR(MAX), and those kinds of fields, and if you need to index them, you won't be able to rebuild those indexes online.

Must Read/ Favorite Papers in Database and related fields

4 votes

Please recommend what you consider to be the most important or just favorite papers in Database Management, Information Systems, Data Mining etc.

Here are couple that I think are important milestones:

  1. What Goes Around Comes Around - Michael Stonebraker, Joseph M. Hellerstein

  2. Dynamo: Amazon’s Highly Available Key-value Store - Giuseppe DeCandia, Deniz Hastorun, Madan et al

  3. Bigtable: A Distributed Storage System for Structured Data - Fay Chang, Jeffrey Dean et al.

  4. MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

  5. Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services

  6. Parallel database systems: the future of high performance database systems - Jim Gray

  7. A Formal Model of Crash Recovery in a Distributed System - Skeen, D. Stonebraker, M

Since I'm new on stackoverflow, I can't post more than two hyperlinks!! So the version with links is at http://www.reddit.com/r/compsci/comments/ghc0w/please_recommend_must_read_favorite_papers_in/c1no849

The book that zamanbakshi recommends, Transaction Processing: Concepts and Techniques, by Gray and Reuter, is really, really good. I carried it around so much that the cover fell off--and this is a hardcover. It is somewhat outdated on some topics, of course, but it is a much better read than most later books, such as Weikum and Vossen's Transactional Information Systems, which is a good book, but makes my eyes glaze over, I have to admit.

If I recall correctly, the Gray and Reuter text doesn't cover Mohan's repeating histories recovery technique, which is essential. See ARIES/NT: A Recovery Method Based on Write-Ahead Logging for Nested Transactions and ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging, at the very least, but most of Mohan's papers are worth reading.

The book Concurrency control and Recovery in Database Systems by Bernstein, et al., is out of print, but you can download it from his Microsoft Research page.

There are also a lot of good publications there from David Lomet and the late (or missing) Jim Gray.

Some important papers that aren't in the 2nd edition of the Red Book (the edition I have):

  • A critique of ANSI SQL isolation levels (1995) Gray, et al.
  • The dangers of replication and a solution (1996) Gray and Helland
  • Generalized Isolation Level Definitions (2000) Adya, et al.

A recent paper that I think deserves more attention is Serializable Isolation for Snapshot Databases (2009) by Cahill, Röhm, and Fekete. It's a really simple technique that works surprisingly well. I hope that it gets implemented in some DBMSes. While searching for related stuff, I came across this interesting reading list. It mostly has flash memory-related stuff, but there are some general papers of interest, too, including some recent Stonebraker papers.

I recommend skipping Date's "Third Manifesto" stuff. I was quite disappointed by it. I don't think he has ever done any object-oriented programming. His earlier books and articles on relational DBMSes are good, if a bit repetitive.

A good description of a main-memory DBMS is The Architecture of the Dalí Main-Memory Storage Manager. The non-action-consisent, non-WAL checkpointing blew my mind at first.

Here are a couple about non-consistent distributed data management (for very, very big data):

  • BASE: an Acid alternative (2008) Pritchett
  • Life beyond Distributed Transactions: an Apostate's Opinion (2007) Helland

DB design and optimization considerations for a social application

4 votes

The usual case. I have a simple app that will allow people to upload photos and follow other people. As a result, every user will have something like a "wall" or an "activity feed" where he or she sees the latest photos uploaded from his/her friends (people he or she follows).

Most of the functionalities are easy to implement. However, when it comes to this history activity feed, things can easily turn into a mess because of pure performance reasons.

I have come to the following dilemma here: i can easily design the activity feed as a normalized part of the database, which will save me writing cycles, but will enormously increase the complexity when selecting those results for each user (for each photo uploaded within a certain time period, select a certain number, whose uploaders I am following / for each person I follow, select his photos )

An optimization option could be the introduction of a series of threshold constraints which, for instance would allow me to order the people I follow on the basis of the date of their last upload, even exclude some, to save cycles, and for each user, select only the 5 (for example) last uploaded photos.

The second approach is to introduce a completely denormalized schema for the activity feed, in which every row represents a notification for one of my followers. This means that every time I upload a photo, the DB will put n rows in this "drop bucket", n meaning the number of people I follow, i.e. lots of writing cycles. If I have such a table, though, I could easily apply some optimization techniques such as clever indexing, as well as pruning entries older than a certain period of time (queue).

Yet, a third approach that comes to mind, is even a less denormalized schema where the server side application will take some part of the complexity off the DB. I saw that some social apps such as friendfeed, heavily rely on the storage of serialized objects such as JSON objects in the DB.

I am definitely still mastering the skill of scalable DB design, so I am sure that there are many things I've missed, or still to learn. I would highly appreciate it if someone could give me at least a light in the right direction.

I would probably start with using a normalized schema so that you can write quickly and compactly. Then use non transactional (no locking) reads to pull the information back out making sure to use a cursor so that you can process the results as they're coming back as opposed to waiting for the entire result set. Since it doesn't sound like the information has any particular critical implications you don't really need to worry about a lock of the concerns that would normally push you away from transactional reads.

Desktop App w/ Database - How to handle data retrieval?

4 votes

Hello everyone,

Imagine to have a Desktop application - could be best described as record keeping where the user inserts/views the records - that relies on a DB back-end which will contain large objects' hierarchies and properties. How should data retrieval be handled?

Should all the data be loaded at start-up and stored in corresponding Classes/Structures for later manipulation or should the data be retrieved only at need, stored in mock-up Classes/Structures and then reused later instead of being asked to the DB again?

As far as I can see the former approach would require a bigger memory portion used and possible waiting time at start-up (not so bad if a splash screen is displayed), while the latter could possibly subject the user to delays during processing due to data retrieval and would require to perform some expensive queries on the database, whose results and/or supporting data structures will most probably serve no purpose once used*.

Something tells me that the solution lies on an in-depth analysis which will lead to a mixture of the two approaches listed above based on data most frequently used, but I am very interested in reading your thoughts, tips and real life experiences on the topic.

For discussion's sake, I'm thinking about C++ and SQLite.

Thanks!

*assuming that you can perform on Classes/Objects faster operations rather than have to perform complicated queries on the DB.

EDIT

Some additional details:

  • No concurrent access to the data, meaning only 1 user works on the data which is stored locally.
  • Data is sent back depending on changes made humanly - i.e. with low frequency. This is not necessarily true for reading data from the DB, where I can expect to have few peaks of lots of reads which I'd like to be fast.
  • What I am most afraid of is the user getting the feeling of slowness when displaying a complex record (because this has to be read in from the DB).

Use Lazy Load and Data Mapper (pg.165) patterns.

Ruby style relational tables in PHP

4 votes

I know ruby (on rails) uses lots of "magic", but I use PHP, is there not a way of implementing the rails-like:

class Player < ActiveRecord::Base
    has_many :cards
end

in PHP it would make so many peoples lives so much easier. Are there perhaps frameworks/ORM's that provide similar functionality etc?

The term you're looking for is an "Object Relational Mapper" (ORM). The one you cite is a component of the Rails framework, called ActiveRecord.

PHP ActiveRecord is one project that attempts to provide this, though the last release was in July 2010. I haven't actually used it; I just switched to Rails. :)