Best database questions in April 2012

Where to put the database sensitive information

10 votes

Possible Duplicate:
How to secure database passwords in PHP?

Recently I was given a website project which was supposed to be done in PHP but I don't have much experience in PHP. Anyway, it is up and running but there is a big room to improve. The one thing that I am not feeling well is the way I am dealing with database: I put the database connection information in a separate db.php file and include it where needed. But I remember seeing PHP source files returned by the server many a time.

So now my question is: what is a better or the best way / place to put database sensitive data?

By the way, how NOT to let PHP show error messages on web pages when things are gone wrong? A custom error page or settings somewhere in php.ini? Thanks!

Note: I am using PHP in it's old flavor not object-oriented way. But I am open to object-oriented or MVC way if there are better approaches that way to prepare for future projects

I don't know if this is what you are looking for:
You can put your sensitive data in your db.php, but outside the web root directory (public_html or www).

For example, you could have a directory called config which is a sibling of your web root directory, and store your db.php file there.

You can include your db.php file like this:

require_once('../config/db.php');

I hope this helps.

Fast(er) method for wildcard searching of 250K+ strings

9 votes

I have an English dictionary in a MySQL database with just over 250K entries, and I'm using a simple ruby front-end to search it using wildcards at the beginning of the strings. So far I've been doing it like this:

SELECT * FROM words WHERE word LIKE '_e__o'

or even

SELECT * FROM words WHERE word LIKE '____s'

I always know the exact length of the word, but all but a single character are potentially unknown.

This is slower than molasses, about fifteen times slower than a similar query without the leading wildcard because the index for the column cannot be used.

I've tried a few methods to narrow the scope of the search. For example, I've added 26 additional columns containing each word's individual letter counts and narrow the search using those first. I've also tried narrowing by word length. These methods made almost no difference, thanks to the inherent inefficiency of leading-wildcard searches. I've experimented with the REGEXP statement, which is even slower.

SQLite and PostgreSQL are just as limited as MySQL, and though I have limited experience with NoSQL systems, my research gives me the impression that they excel at scalability, not performance of the kind I need.

My question then, is where should I look for a solution? Should I continue trying to find a way to optimize my queries or add supplementary columns that can narrow my potential recordset? Are there systems designed specifically to accomplish fast wildcard searching in this vein?

With PostgreSQL 9.1 and the pg_trgm extension you can create indexes that are usable for a like condition you are describing.

For an example see here: http://www.depesz.com/2011/02/19/waiting-for-9-1-faster-likeilike/

I verified it on a table with 300k rows using LIKE '____1' and it does use such an index. It took about 120ms to count the number of rows in that table (on an old laptop). Interesting enough the expression LIKE 'd___1' is not faster, it's about the same speed.

It also depends on the number of characters in the search term, the longe it gets, the slower it will be as far as I can tell.

You would need to check with your data if the performance is acceptable.

How expensive are MySQL events?

8 votes

In my web app I use two recurring events that "clean up" one of the tables in the database, both executed every 15 minutes or so.

My question is, could this lead to problems in performance in the future? Because I've read somewhere -I don't recall where exactly- that MySQL events are supposed to be scheduled to run once a month or so. Thing is, this same events keep the table in a pretty reduced size (as they delete records older than 15~ minutes), maybe this compensates the frequency of their execution, right?

Also, is it better to have one big MySQL event or many small ones if they are be called in the same frequency?

I don't think there's a performance indication in the monthly base just more of a suggestion of what to do with it. So i think you're ok with doing your cleanup using the events.

In the end the documentation suggets that the events are

Conceptually, this is similar to the idea of the Unix crontab (also known as a “cron job”) or the Windows Task Scheduler.

And the concept for those is that you can run a task every minute if you wish to do so.


On the second part of that question:

Serialize or spread it up. If you split them up into many events that will run at the same time you will create spikes of possibly very high cpu usage that might slow down the application while processing the events.

So either pack everything into one event so it runs in succession or spread the single events up so they execute on different times during the 15 minutes timeframe. Personally i think the first one is to be preferred, pack them up into a single event as then they are guaranteed to run in succession, even if a single one of them keeps running longer than usual.

The same goes for cronjobs. If you shedule 30 long-running exports at a single time your application is going to fail miserably during that timeslot (learned that the hard way).

Simple SQL Join Understanding?

8 votes

Possible Duplicate:
Explicit vs implicit SQL joins
Is there a difference using join andselect from multi-tables?
SQL Joins: Future of the SQL ANSI Standard (where vs join)?

What is the difference between JOIN and declaring multiple tables in the FROM clause?

Such as:

SELECT *
FROM  table1 AS t1
      , table2 AS t2
WHERE t1.id = t2.id

Compared to:

SELECT *
FROM table1 AS t1
INNER JOIN table2 AS t2 ON t2.id = t1.id

The second version, with the explicit JOIN and join condition is standardized SQL.

The implicit join syntax with a WHERE clause is deprecated syntax (or, rather, considered bad) - partially because it is easy to forget the WHERE clause and cause a Cartesian product.

What's the best way to store different images in the database?

5 votes

Don't truly know how to name this question.

But whats the best way (regarding database design) for storing images for different purposes?

Let me give an example. I have a bunch of user photos and i got another 5 different sets of photos (like user photos but with no connection to user photos).

Is the best thing to store all photos in a single database table and try to reference them from within that table? or is the best to create different tables for each set of photos?

I can see one benefit from creating multiple tables and that's the cascade delete function for removing the photo when the main object is deleted.

Any other aspects to consider?

Another example could be addresses. A user can have an address but so can a company or a location. Create one table for all addresses and try to have some sort of index tables to reference what address belongs to what object or have different tables and eliminate the problem.

Sorry again for the bad naming of the question.

How to store large blobs in sql server

Storing large chunks of binary data in SQL Server is not a great approach. It makes your database very bulky to backup and performance is generally not great. Storing files is usually done on the file system. Sql Server 2008 has out of the box support for FILESTREAM. Microsoft documents the cases to use FileStream as follows

  • Objects that are being stored are, on average, larger than 1 MB.
  • Fast read access is important.
  • You are developing applications that use a middle tier for application logic.

In your case I think all points are valid.

Enable on Server

To enable FILESTREAM support on the server use the following statement.

EXEC sp_configure filestream_access_level, 2
RECONFIGURE

Configure the Database

To get a filestream filegroup linked to your database create

ALTER DATABASE ImageDB ADD FILEGROUP ImageGroup CONTAINS FILESTREAM
ALTER DATABASE ImageDB 
  ADD FILE ( NAME = 'ImageStream', FILENAME = 'C:\Data\Images\ImageStream.ndf')
  TO FILEGROUP TodaysPhotoShoot

Creating the table

The next step is getting your data in the database with filestream storage:

CREATE TABLE Images
(
    [Id] [uniqueidentifier] ROWGUIDCOL NOT NULL PRIMARY KEY, 
    [CreationDate] DATETIME NOT NULL,
    [ImageFile] VARBINARY(MAX) FILESTREAM NULL
)

For Filestream to work you not only need the FILESTREAM property on a field in the table, but also a field which has the ROWGUIDCOL property.

Inserting Data with TSQL

Now to insert data in this table you can use TSQL:

using(var conn = new SqlConnection(connString))
using(var cmd = new SqlCommand("INSERT INTO Images VALUES (@id, @date, cast(@image as varbinary(max))", conn))
{
     cmd.Parameters.AddRange(new {
          new SqlParameter("id", SqlDbType.UniqueIdentifier).Value = uId,
          new SqlParameter("date", SqlDbType.DateTime).Value = creationDate,
          new SqlParameter("image", SqlDbType.varbinary).Value = imageFile,
      });
     conn.Open
     cmd.ExecuteScalar();
}

Inserting data using SqlFileStream

There also exists an approach to get the file data on disk using Win32 directly. This offers you streaming access SqlFileStream inherits from IO.Stream.

Inserting data using win32 can be done with for example the code below:

    public void InsertImage(string connString, Guid uId, DateTime creationDate, byte[] fileContent)
    {
        using (var conn = new SqlConnection(connString))
        using (var cmd = new SqlCommand(@"INSERT INTO Images VALUES (@id, @date, cast(@image as varbinary(max)) output INSERTED.Image.PathName()" , conn))
        {
            conn.Open();

            using (var transaction = conn.BeginTransaction())
            {
                cmd.Transaction = transaction;
                cmd.Parameters.AddRange(
                    new[] {
                         new SqlParameter("id", SqlDbType.UniqueIdentifier).Value = uId,
                         new SqlParameter("date", SqlDbType.DateTime).Value = creationDate,
                         new SqlParameter("image", SqlDbType.VarBinary).Value = null
                        }
                    );

                var path = (string)cmd.ExecuteScalar();

                cmd.CommandText = "SELECT GET_FILESTREAM_TRANSACTION_CONTEXT()";

                var context = (byte[])cmd.ExecuteScalar();

                using (var stream = new SqlFileStream(path, context, FileAccess.ReadWrite))
                {
                    stream.Write(fileContent, 0, fileContent.Length);
                }

                transaction.Commit();
            }
        }

How to model a Photo storage database

With the filestream approach to store the images the table is very narrow which is good for performance since many records can be stored per 8K data page. I would use the following model:

    CREATE TABLE Images
    (
        Id uniqueidentifier ROWGUIDCOL NOT NULL PRIMARY KEY, 
        ImageSet INTEGER NOT NULL 
            REFERENCES ImageSets,
        ImageFile VARBINARY(MAX) FILESTREAM NULL
    )

    CREATE TABLE ImageSets
    (  
        ImageSet INTEGER NOT NULL PRIMARY KEY,
        SetName nvarchar(500) NOT NULL,
        Author INTEGER NOT NULL
            REFERENCES Users(USerId)
    )

   CREATE TABLE Users
   (
        UserId integer not null primary key,
        UserName nvarchar(500),
        AddressId integer not null
             REFERENCES Addresses
   )

   CREATE TABLE Organsations
   (
        OrganisationId integer not null primary key
        OrganisationName nvarchar(500),
        AddressId integer not null
             REFERENCES Addresses
   )

   CREATE TABLE Addresses
   (
       AddressId integer not null primary key,
       Type nvarchar(10), 
       Street nvarchar(500),
       ZipCode nvarchar(50),
       City nvarchar(500),
   )

   CREATE TABLE OrganisationMembers
   (
       OrganisationId integer not null
          REFERENCES Organisations,
       UserId integer not null
          REFERENCES Users,
       PRIMARY KEY (UserId, OrganisationId)
   )
   CREATE NONCLUSTERED INDEX ixOrganisationMembers on OrganisationMembers(OrganisationId)

This translates to the following Entity RelationShip Diagram:

Entity RelationShip Diagram

  • Performance wise, the narrow images table is very good as it contains only a few bytes of data per record.
  • We can assume that an image is always member of an Image Set, The Set information could be hidden if there is only 1 image in it.
  • I assume you want to track which users are member of which organisations, so I added a table to link them (Assuming a user can be member of multiple organisations).
  • The primary key on the OrganisationMembers table has UserId as first field since there normally a lot more users than Organisations and you probably will want to show which organisations a user is member off more often than the inverse.
  • The index on OrganisationId in OrganisationMembers is there to cater for queries where the list of members for a specific Organisation needs to be shown.

References:

MySQL: why varchar(254) and not varchar(255)?

5 votes

Why do so many developers set varchar to 254 and not 255 when creating MySQL tables?

Proof that this happens: mysql varchar 254

varchar fields require n+1 bytes for fields less than or equal to 255 and required n+2 bytes for fields > 255

http://dev.mysql.com/doc/refman/5.0/en/storage-requirements.html

It should be set to 255, I'm assuming developers think they will save an extra byte from 254, but 255 is the standard

Database build process management

5 votes

What options exists to manage database scripts and do a new development for database:

For example, the database used by a number of applications and there are a number of developers working with database, what will be the best options to maintain database up to date with the last changes and what should be the process of deployment changes to production

I see two options:

  1. Microsoft visual studio has a database project, so all database scripts should be add in the project and database can be rebuild from visual studio
  2. Restore database from backup and apply only new scripts to database

What another options exists? How can I manage database development, what is the best practices? what will be advantages and disadvantages of options I write above? How to maintain new sql scripts?

I understand then source control system should be used, but with DB scripts it's not so easy as with application.

I believe it will be no universal solution, but at least I am interesting in DB developers opinion how it's implemented in your company.

Liquibase is IMHO the best tool. It's brutally simple in its approach, which is one of the reasons it works so well.

You can read up on the site how it works, but basically it creates and manages a simple table that stores a hash of each script to determine if it has run a script of not. There's pre- and post- sql too, and you can bypass on conditions... it does pretty much everything you'd want or need. It also has maven integration, so it can seamlessly become part of your build.

I used it very successfully on a large (8 developers) project and now I wouldn't use anything else.

And it's free!

inserting / updating a row based on its availability in database

5 votes

This is a common scenario, but i wanted to find out which way is the performance optimized way and best practice.

I have a table with 4 columns: id, name, and two other fields. Id is the PK and name is a unique key. I'm reading data from excel file, populate the values of each row in a Domain object and then saving it. When saving, i want to see whether a record already exists for the same name and if exists, i want to update it. Else save it as a new record.

I can do it with normal select query for the name and check for null, and based on that insert or update but i have thousands of rows to be read from excel files and a non-functional requirement requested is the performance.

So please advice me on which is the best way to handle this senario? i haven't started coding my persistence layer part yet, so i can switch to an ORM or plain jdbc according to your suggestion.

Edited: If i use name as primary key, then i think i can use saveOrUpdate or merge from an ORM, to fullfill my need. Is it a good idea??? Thanks & regards, Prasath.

I think the fastest way would be to carry out all the insert/updates in the database itself rather than connecting to it and using a large number of statements.

Note, this is Oracle specific, but other databases may have similar concepts.

I would use the following approach: First save the Excel data as a CSV file on the database server (/mydatadir/mydata.csv), then in Oracle I would be using an external table:

create or replace directory data_dir as '/mydatadir/';
create table external_table (
  id number(18),
  name varchar2(30),
  otherfield1 varchar2(40),
  otherfield2 varchar2(40))
organization external (
  type oracle_loader
  default directory data_dir
  access parameters
  ( fields terminated by ',' )
  location ('mydata.csv')
)

(Note, the external table wouldn't have to be set up every time)

Then you can use the following command to merge the data into your table:

merge into yourtable t
using external_table e
on t.name = e.name
when matched then
   update set t.id = e.id, 
              t.otherfield1 = e.otherfield1, 
              t.otherfield2 = t.otherfield2
when not matched then
   insert (t.id, t.name, t.otherfield1, t.otherfield2)
   values (e.id, e.name, e.otherfield1, e.otherfield2)

This will upsert the rows in yourtable in one Oracle command, so all the work will be carried out by the database.

EDIT:

This merge command can be issued over plain JDBC (though I prefer using Spring's SimpleJdbcTemplate)

EDIT2:

In MySQL you can use the following construct to perform the merge:

insert into yourtable (id, name, otherfield1, otherfield2)
values (?, ?, ?, ?), 
       (?, ?, ?, ?), 
       (?, ?, ?, ?) --repeat for each row in the Excel sheet...
on duplicate Key update
set otherfield1 = values(otherfield1),
    otherfield2 = values(otherfield2)

This can be issued as a plain JDBC statement and is going to be better than a separate update and insert, and you can call these in batches of (say) a hundred rows from the spreadsheet. This would mean 1 JDBC call for every 100 rows in your Excel sheet and should perform well. That'll allow you to do it without external tables (you'd need a UNIQUE index on the name column for this to work, I wouldn't change the primary key as this could cause you problems with foreign keys if you needed to change somebody's name).

MySQL also has the concept of external tables, which I think would be faster still than inserting the data as batches as per above. As long as the csv file is uploaded to the correct location, the import should work quickly.

Android when should I be opening and closing db connections?

5 votes

Overview: Opening db connections in the onStart and in async tasks has gotten to be quite complex. Is it bad practice to have globally available db connections? If it is bad what is a better way?

Details: I have an application that connects to the sqlite3 database in several activities. At first there weren't many places where I needed to access the database so I was just opening and closing each time I needed access. Then there were more places that needed to access it so, as suggested on another Stack Overflow question, I started opening the database connections in the onStart method of the activity that needed a connection and closing it in the onStop method.

This worked fine until I started needing connections in some asynchronous tasks that outlived the activity. Since the onStop method for the activity had been called and connections had been closed, by the time the async task tried accessing the database it was failing. As a solution I created separate connections for each async task that were opened in the onPreExecute method and closed in the onPostExecute method.

This has resulted in a lot of opening and closing of connections, and I'm wondering if creating globally available db connections in the application context is a good idea. It would definitely clean up a lot of code and remove any unclosed db exceptions that are happening if I forget to close a connection or the app experiences a force close. Any one else tried this / see any problems with this approach?

I ran into similar problems some weeks ago. I use several classes that have persistence in a SQLite db.

As I wanted to decouple the activities from persistence I created static inner classes (called Managers) for those that need persistence . Each time I instantiate a Manager a db connection is created and after using the manager I explicitly close it. If I had global managers to access the tables, they should be synchronized due to concurrency and I do not know when I should close those db connections. So I did it this way.

Hope this helps!

Cannot connect to mysql with 127.0.0.1

5 votes

With the following code I can connect to mysql: mysql_connect("localhost","username","");

But if I change localhost to 127.0.0.1 I get the following error:

Can't connect to MySQL server on '127.0.0.1' (13)

Why doesn't it work with 127.0.0.1?

localhost is special cased and uses UNIX sockets instead of TCP/IP. 127.0.0.1 doesn't get that special handling.

See the documentation:

On Unix, MySQL programs treat the host name localhost specially, in a way that is likely different from what you expect compared to other network-based programs. For connections to localhost, MySQL programs attempt to connect to the local server by using a Unix socket file. This occurs even if a --port or -P option is given to specify a port number. To ensure that the client makes a TCP/IP connection to the local server, use --host or -h to specify a host name value of 127.0.0.1, or the IP address or name of the local server. You can also specify the connection protocol explicitly, even for localhost, by using the --protocol=TCP option.

If it doesn't work when you use TCP/IP then the database probably isn't listening on the network. This is generally a good thing as it enhances security (not that listening on 127.0.0.1 exposes any problems, but listening on all interfaces gives more opportunity for attacks).

If you really want to allow connections via the network, then see skip-networking.

Is it harmful to store javascript and css in a database?

5 votes

Ok so I'm trying to store user custom css and javascript in mysql database to be used later the head section on the page, so is it a good idea to store css and javascript in a database ? and if not what is the safe way to do this ?

I'm using wordpress and using esc_js() on javascript code I noticed it adds a backslash before quotes, and adds \n instead of new lines, so is it enough for javascript ?

and thanks in advance.

Short answer: Yes, it's fine as long as it's people you trust.

Long answer:

Make sure you 100% trust your users. There's a lot of exploits that can be done through CSS/JS, so make sure you're only allowing people you completely trust to do this. Even if you tried to implement filtering of some kind, people would find ways around.

Depending on how easily you could figure out the file name, I might consider storing the data in a file. After all, the database is unnecessary overhead if you can readily determine the file name. If you can't determine the file name easily, I just go ahead and store it in the database. (Or if you're planning on storing any meta data or version information or anything, I would go ahead and go for a DB.)