Best php questions in September 2011

Match a^n b^n c^n (e.g. "aaabbbccc") using regular expressions (PCRE)

30 votes

It is a well known fact that modern regular expression implementations (most notably PCRE) have little in common with the original notion of regular grammars. For example you can parse the classical example of a context-free grammar {anbn; n>0} (e.g. aaabbb) using this regex (demo):

~^(a(?1)?b)$~

My question is: How far can you go? Is it also possible to parse the context-sensitive grammar {anbncn;n>0} (e.g. aaabbbccc) using PCRE?

Inspired by NullUserExceptions answer (which he already deleted as it failed for one case) I think I have found a solution myself:

$regex = '~^
    (?=(a(?-1)?b)c)
     a+(b(?-1)?c)
$~x';

var_dump(preg_match($regex, 'aabbcc'));    // 1
var_dump(preg_match($regex, 'aaabbbccc')); // 1
var_dump(preg_match($regex, 'aaabbbcc'));  // 0
var_dump(preg_match($regex, 'aaaccc'));    // 0
var_dump(preg_match($regex, 'aabcc'));     // 0
var_dump(preg_match($regex, 'abbcc'));     // 0

Try it yourself: http://codepad.viper-7.com/1erq9v


Explanation

If you consider the regex without the positive lookahead assertion (the (?=...) part), you have this:

~^a+(b(?-1)?c)$~

This does nothing more than check that there's an arbitrary number of as, followed by an equal number of bs and cs.

This doesn't yet satisfy our grammar, because the number of as must be the same, too. We can ensure that by checking that the number of as equals the number of bs. And this is what the expression in the lookahead assertion does: (a(?-1)?b)c. The c is necessary so we don't only match a part of the bs.


Conclusion

I think this impressively shows that modern regex is not only capable of parsing non-regular grammars, but can even parse non-context-free grammars. Hopefully this will lay to rest the endless parroting of "you can't do X with regex because X isn't regular"

Javascript closures vs PHP closures, what's the difference?

21 votes

What are the differences between closures in JS and closures in PHP? Do they pretty much work the same way? Are there any caveats to be aware of when writing closures in PHP?

One difference is how both cope with storing the context in which an anonymous function is executed:

// JavaScript:
var a = 1;
var f = function() {
   console.log(a);
};
a = 2;
f();
// will echo 2;

// PHP
$a = 1;
$f = function() {
    echo $a;
};
$a = 2;
$f();
// will result in a "PHP Notice:  Undefined variable: a in Untitled.php on line 5"

To fix this notice you'll have to use the use syntax:

$a = 1;
$f = function() use ($a) {
    echo $a;
};
$a = 2;
$f();
// but this will echo 1 instead of 2 (like JavaScript)

To have the anonymous function behave somehow like the JavaScript counterpart you'll have to use references:

$a = 1;
$f = function() use (&$a) {
    echo $a;
};
$a = 2;
$f();
// will echo 2

I think this is the most striking difference between JavaScript and PHP closures.

Second difference is that every JavaScript closure has a this context available which means, that you can use this inside the closure itself (although it's often quite complicated to figure out what this actually refers to) - PHP's current stable version (PHP 5.3) does not yet support $this inside a closure, but PHP's upcoming version (PHP 5.4) will support $this binding and rebinding using $closure->bind($this) (See the Object Extension RFC for more info.)

Third difference is how both languages treat closures assigned to object properties:

// JavaScript
var a = {
    b: function() {}
};
a.b(); // works


// PHP
$a = new stdClass();
$a->b = function() {};
$a->b(); // does not work "PHP Fatal error:  Call to undefined method stdClass::b() in Untitled.php on line 4"

$f = $a->b;
$f(); // works though

The same is true if closures are assigned to properties in class definitions:

class A {
    public $b;

    public function __construct() {
        $this->b = function() {};
    }

    public function c() {
        $this->b();
    }
}
$a = new A();
// neither
$a->b();
// nor
$a->c();
// do work

Fourth difference: JavaScript Closures are full fledged objects, wheres in PHP they are restricted objects. For instance, PHP Closures cannot have properties of their own:

$fn = function() {};
$fn->foo = 1;
// -> Catchable fatal error: Closure object cannot have properties

while in JavaScript you can do:

var fn = function() {};
fn.foo = 1;
fn.foo; // 1

Fifth difference: Returned closures can be immediately called upon in Javascript:

var fn = function() { return function() { alert('Hi');}}
fn()();    

Not in PHP:

$fn = function() { return function() { echo('Hi');};};
$fn()();     // syntax error

Find first character that is different between two strings

19 votes

Given two equal-length strings, is there an elegant way to get the offset of the first different character?

The obvious solution would be:

for ($offset = 0; $offset < $length; ++$offset) {
    if ($str1[$offset] !== $str2[$offset]) {
        return $offset;
    }
}

But that doesn't look quite right, for such a simple task.

You can use a nice property of bitwise XOR (^) to achieve this: Basically, when you xor two strings together, the characters that are the same will become null bytes ("\0"). So if we xor the two strings, we just need to find the position of the first non-null byte using strspn:

$position = strspn($string1 ^ $string2, "\0");

That's all there is to it. So let's look at an example (on codepad):

$string1 = 'foobarbaz';
$string2 = 'foobarbiz';
$pos = strspn($string1 ^ $string2, "\0");

printf(
    'First difference at position %d: "%s" vs "%s"',
    $pos, $string1[$pos], $string2[$pos]
);

That will output:

First difference at position 7: "a" vs "i"

So that should do it. It's very efficient since it's only using C functions, and requires only a single copy of memory of the string.

building a 'two-way' OO dynamic ACL system

15 votes

This question came up while designing a dedicated ACL system for a custom application, but I think it applies to ACL systems in general, as I haven't found out how to tackle this problem by looking at some of the mainstream systems, like Zend_ACL.

In my application, the permissions are granted dynamically, for example: a user gets view permissions on an activity because he is a member of the team the activity is linked to. This builds on the assumption that you always have an Employee (user) that wants to perform an action (view/edit/etc) on an Item (one of the objects in my application, eg Activity, Team, etc). This is sufficient for my targeted use;

$Activity = new Activity( $_POST['activity_id'] );

$Acl = new Acl( $Activity );
if ( !$Acl->check( 'edit' ) {
    throw new AclException('no permission to edit');
}

My Acl class contains all the business rules to grant the permissions, and they're created 'on the fly' (although sometimes cached for performance reasons);

/**
 * Check the permissions on a given activity.
 * @param Activity $Activity
 * @param int $permission (optional) check for a specific permission
 * @return mixed integer containing all the permissions, or a bool when $permission is set
 */
public function checkActivity( Activity $Activity, $permission = null ) {
    $permissions = 0;

    if ( $Activity->owner_actor_id == $this->Employee->employee_id ) {
        $permissions |= $this->activity['view'];
        $permissions |= $this->activity['remove'];
        $permissions |= $this->activity['edit'];
    } elseif ( in_array( $this->Employee->employee_id, $Activity->contributor_ids_arr ) ) {
        $permissions |= $this->activity['view'];
    } else {
        /**
         * Logged in user is not the owner of the activity, he can contribute 
         * if he's in the team the activity is linked to
         */
        if ( $Activity->getTeam()->isMember( $this->Employee ) ) {
            $permissions |= $this->activity['view'];
        }
    }

    return ( $permission ? ( ( $permission & $permissions ) === $permission ) : $permissions );
}

This system works fine as-is.

The problem with this approach arises when you want to 'reverse' the ACL rules. For instance, "fetch all activities that I'm allowed to edit". I don't want to put any logic like WHERE owner_actor_id = $Employee->employee_id in the code that needs the activities, because this is the responsibility of the Acl class and it should be kept centralized. With the current implementation, I have no other option that to fetch all activities in the code, and then assert them one by one. This is of course a very inefficient approach.

So what I'm looking for is some ideas on a good architecture (or a pointer to an existing ACL implementation or some relevant design patterns) to create an ACL system that can somehow do both hasPermission( $Item, $permission ) and fetchAllItems( $permission ), ideally with the same set of business rules.

Thank you all in advance!


I've looked at the Zend_ACL implementation, but that focuses more on general permissions. I also found the following questions here on SO:

But unfortunately they don't seem to answer the question either.

A colleague offered me another view on the matter, that might also be the solution to this problem.

What I thought I wanted is put all access-related code in the ACL class (mirroring my statement that "I don't want to put any logic like WHERE owner_actor_id = $Employee->employee_id in the code that needs the activities, because this is the responsibility of the Acl class and it should be kept centralized.").

What I really want is to make sure the user can never access something that doesn't comply to the rules listed in the ACL class. It's not really a problem if the 'worker code' already fetches a subset of the data -- as long as it's ultimately checked against the 'real' ACL. The worst that can happen is that the user sees less than he's supposed to, which is way better than more.

With this 'solution' (alternative approach if you will), you avoid fetching all the data, while maintaining the benefit of having all the rules in one place. Any other solution that I could think of would involve a duplication of the rules, since you need rules in PHP for checking a given resource, and rules written in MySQL for fetching all.

It's still possible, by the way, to put the subset fetch code in the Acl class -- however I think it would be better to keep the class small and focused (because I think the readability of the code in that class is also very important).

Is en_UK an illegal locale?

14 votes

So far I had always used 'en_UK' for British English. Today I got an error when using it with Zend Framework because the locale is not included in the long list of recognized locales.

Here's just a short extract of that list:

'ee_GH' => true, 'ee_TG' => true, 'ee'    => true, 'el_CY' => true, 'el_GR' => true,
'el'    => true, 'en_AS' => true, 'en_AU' => true, 'en_BE' => true, 'en_BW' => true,
'en_BZ' => true, 'en_CA' => true, 'en_GB' => true, 'en_GU' => true, 'en_HK' => true,
'en_IE' => true, 'en_IN' => true, 'en_JM' => true, 'en_MH' => true, 'en_MP' => true,
'en_MT' => true, 'en_NA' => true, 'en_NZ' => true, 'en_PH' => true, 'en_PK' => true,
'en_SG' => true, 'en_TT' => true, 'en_UM' => true, 'en_US' => true, 'en_VI' => true,
'en_ZA' => true, 'en_ZW' => true, 'en'    => true, 'eo'    => true, 'es_AR' => true,
'es_BO' => true, 'es_CL' => true, 'es_CO' => true, 'es_CR' => true, 'es_DO' => true,

As you can see, there are all kinds of territories with English language and there is also an entry 'en_GB' which I assume stands for Great Britain... but there is no 'en_UK'. Is that just a 'bug' in Zend Framework or is there another reason for that?

Locales use ISO 3166-1 for country codes. The wikipedia writeup includes:

The codes are chosen, according to the ISO 3166/MA, "to reflect the significant, unique component of the country name in order to allow a visual association between country name and country code".[7] For this reason, common components of country names like "Republic", "Kingdom", "United", "Federal" or "Democratic" are normally not used for deriving the code elements. As a consequence, for example, the United Kingdom is officially assigned the alpha-2 code GB rather than UK, based on its official name "United Kingdom of Great Britain and Northern Ireland" (although UK is reserved on the request of the United Kingdom).

JSON and PHP arrays

14 votes
json_encode(array(
    array(0 => "431.940054495913"),
    array(1 => "431.940054495913"),
));

Is render like this:

[
    ["431.940054495913"],
    {"1":"431.940054495913"}
]

Why are the two arrays rendered differently ?

How to render the first array like the second ?

Any PHP array that can be rendered as a JSON array will be rendered as a JSON array: Any PHP array having only sequential numeric keys starting from 0 will be rendered as a JSON array.

This is the case for the first array: array(0 => "431.940054495913").

How to fix this

  • The JSON_FORCE_OBJECT flag will render all PHP arrays as JSON objects

    json_encode(array(0 => "431.940054495913"), JSON_FORCE_OBJECT);
    // {"0": "431.940054495913"}
    
    json_encode(array(0 => "431.940054495913"));
    // ["431.940054495913"]
    
  • Alternatively, you could convert your PHP array to a PHP object:

    json_encode( (object) array(0 => "431.940054495913"));
    // {"0": "431.940054495913"}
    

    (if you don't want to render every array as object or if you don't have JSON_FORCE_OBJECT)

Why does an infinitely recursive function in PHP cause a segfault?

13 votes

A hypothetical question for you all to chew on...

I recently answered another question on SO where a PHP script was segfaulting, and it reminded me of something I have always wondered, so let's see if anyone can shed any light on it.

Consider the following:

<?php

  function segfault ($i = 1) {
    echo "$i\n";
    segfault($i + 1);
  }

  segfault();

?>

Obviously, this (useless) function loops infinitely. And eventually, will run out of memory because each call to the function executes before the previous one has finished. Sort of like a fork bomb without the forking.

But... eventually, on POSIX platforms, the script will die with SIGSEGV (it also dies on Windows, but more gracefully - so far as my extremely limited low-level debugging skills can tell). The number of loops varies depending on the system configuration (memory allocated to PHP, 32bit/64bit, etc etc) and the OS but my real question is - why does it happen with a segfault?

  • Is this simply how PHP handles "out-of-memory" errors? Surely there must be a more graceful way of handling this?
  • Is this a bug in the Zend engine?
  • Is there any way this can be controlled or handled more gracefully from within a PHP script?
  • Is there any setting that generally controls that maximum number of recursive calls that can be made in a function?

I could be totally wrong about this since my testing was fairly brief. It seems that Php will only seg fault if it runs out of memory (and presumably tries to access an invalid address). If the memory limit is set and low enough, you will get an out of memory error beforehand. Otherwise, the code seg faults and is handled by the OS.

Can't say whether this is a bug or not, but the script should probably not be allowed to get out of control like this.

See the script below. Behavior is practically identical regardless of options. Without a memory limit, it also slows my computer down severely before it's killed.

<?php
$opts = getopt('ilrv');
$type = null;
//iterative
if (isset($opts['i'])) {
   $type = 'i';
}
//recursive
else if (isset($opts['r'])) {
   $type = 'r';
}
if (isset($opts['i']) && isset($opts['r'])) {
}

if (isset($opts['l'])) {
   ini_set('memory_limit', '64M');
}

define('VERBOSE', isset($opts['v']));

function print_memory_usage() {
   if (VERBOSE) {
      echo memory_get_usage() . "\n";
   }
}

switch ($type) {
   case 'r':
      function segf() {
         print_memory_usage();
         segf();
      }
      segf();
   break;
   case 'i':
      $a = array();
      for ($x = 0; $x >= 0; $x++) {
         print_memory_usage();
         $a[] = $x;
      }
   break;
   default:
      die("Usage: " . __FILE__ . " <-i-or--r> [-l]\n");
   break;
}
?>

Escaping user input from database necessary?

12 votes

So I know about MySQL injection and always escape all my user input before putting it in my database. However I was wondering, imagine a user tries to submit a query to inject, and I escape it. What if I then at a later moment take this value from the database, and use it in a query. Do I have to escape it again?

So: (sql::escape() contains my escape function)

$userinput = "'); DROP `table` --";
mysql_query("INSERT INTO `table` 
             (`foo`,`bar`) 
             VALUES 
             ('foobar','".sql::escape($userinput)."')");

// insert php/mysql to fetch `table`.`bar` into $output here

mysql_query("INSERT INTO `table2` 
            (`foo`,`bar`) 
            VALUES
            ('foobar','".$output."')");

Does MySQL automatically escape their output or something like that, or should I escape in the second query as well?

This is a testcase but this occurs in some other ways within my program and I'm wondering how tight the security has to be for cases like this.

EDIT

My escape function

static function escape($string){

    if(get_magic_quotes_gpc()) 
        $string = stripslashes($string); 

    return mysql_real_escape_string($string);

}

Does MySQL automatically escape their output or something like that, or should I escape in the second query as well?

You need to escape in the second query as well. MySQL does not do any escaping on its output.

Long answer: MySQL string escaping does not modify the string that is being inserted, it just makes sure it doesn't do any harm in the current query. Any SQL injection attempt still remains in the data.

Proper Way To Use Git/GitHub - PHP System with Dev/Testing/Production servers

11 votes

I apologize if this is obvious or easy, I have looked at a good number of git/github tutorials and read other articles, but I want to make sure what I'm doing is right.

I want to incorporate VC (for obvious reasons) into my development team and process.

Current development process (using Dreamweaver):
* Receive a ticket (or work order)
* Download file on Development server
* Make changes to the file
* Upload file back to development server
* Changes tested/verified
* Send to production server


I'm trying to figure out how to make our new development process with using Git.

I am switching over to PHPStorm (which is an actual PHP IDE with direct integration with Git).

Would it be something like

  • Receive a ticket (or work order)
  • Checkout/Update/Download file(s)
  • Change Files
  • Upload file (which I assume is also the current working directory...?)
  • At the end of the day, do a commit
  • Have build script send data to testing server (nightly build)

Or would it be better to do something like

  • Receive a ticket (or work order)
  • Checkout/Update/Download file(s)
  • Change Files
  • Upload file/commit
  • Have build script send data to testing server (nightly build)

Or is there another way? Having a bit of trouble understanding what would be the optimal flow?

Any help would be greatly appreciated.


Edit

I'm see if it is best to have a version of the server locally (every developer), and if so, how does that work if you have 7 or so branches?

If not, how do you deal with 7 or so branches with them on the web? Do you FTP files up or use Git Hooks to make them auto update?

Assuming you have a live server and a development server I would do something along these lines.

Before even starting with a development cycle I would at least have two branches:

  1. Master - the development server runs on this branch
  2. Stable - the live server runs on this branch.

So if a developer gets a ticket or a work order he/she will perform the following actions:

  1. git pull origin master
  2. git branch featureBranch (named as the ticket id or as a good description for the work order)
  3. git checkout featureBranch
  4. Make changes which will accomplish the desired change. Commit as often as is necessary. Do this because you will create valuable history. For instance you can try an approach to a problem and if it doesn't work, abandon it. If a day later you see the light and want to re-apply the solution, it is in your history!
  5. When the feature is fully developed and tested locally, checkout master.
  6. git merge featureBranch
  7. git push origin master
  8. Test the pushed changes on your development server. This is the moment to run every test you can think of.
  9. If all is working out, merge the feature or fix into the stable branch. Now the change is live for your customers.

Getting the code on the server

The updating of servers shouldn't be a problem. Basically I would set them up as users just like you're developers are. At my company we've setup the servers as read-only users. Basically that means the servers can never push anything but can always pull. Setting this up isn't trivial though, so you could just as well build a simple webinterface which simply only allows a git pull. If you can keep your developers from doing stuff on live implementations you're safe :)

[EDIT]

In response to the last questions asked in the comments of this reaction:

I don't know if I understand your question correctly, but basically (simplified a bit) this is how I would do this, were I in you shoes. Example setup

The testing machine (or the webroot which acts as testing implementation) has it source code based in a git repository with the master branch checked out. While creating this repository you could even remove all other references to all other branches so you'll be sure no can checkout a wrong branch in this repository. So basically the testing machine has a Git repository with only a master branch which is checked out.

For the live servers I would do exactly the same, but this time with the stable branch checked out. Developer should have a local repository cloned in which all branches exist. And a local implementation of the software you guys build. This software gets its source from a the local git repository. In other words: from the currently checked out branch in this repository.

Actual coding

When a new feature is wanted, a local feature branch can be made based on the current master. When the branch is checked out the changes can be made and checked locally by the developer (since the software is now running on the source of the feature branch).

If everything seems to be in order, the changes get merged from feature branch to master and pushed to your "git machine". "your github" so to speak. Testing can now pull the changes in so every test necessary can be done by QA. If they decide everything is ok, the developer can merge the changes from master to stable and push again.

All thats left now is pulling form your live machines.

PHP Interfaces: How are they usable in practice?

10 votes

I'll start by saying that I know how PHP interfaces work and how to "use" them.

My question is rather; how do they become useful in real life applications?

I've been writing PHP for over 3 years now and have never felt the need for an interface. I am writing interfaces more for good practice than for a particular purpose.

I'll provide an example where I've used interfaces in my own real-world experience.

Interfaces are extremely useful when you need to define something like a plugin architecture. Suppose your application accepts authentication plugins, allowing your end user implementers to integrate with their own internal auth infrastructures (LDAP, Shibboleth, some custom auth database, whatever). In order for a plugin to be compatible, it must implement the following methods:

validate_user()
start_user_session()
logout_user()
get_user_details()

If defined in an interface, a class then implements the interface ensuring that the necessary methods are present for compatibility with the plugin architecture.

Unfortunately, PHP does not enforce the return type of interface methods, so you must simply take care to document the expected return functionality of your methods.

I'm about to use a goto statement

10 votes

Save me from raptor death - is there any better way to handle this kind of structure?

while(condition) {
    $this->phase1();
    $this->phase2();
    $this->phase3();
    $this->phase4();
}

Throughout either one of those methods, the condition could be met. IMMEDIATELY after the condition is met, the loop MUST exit. If I could call break; inside of phase2(); for example, I wouldn't need a goto statement (but of course, that would throw an error).

Return a boolean to execute each stage until successful.

while (condition) {
    if ($this->phase1() || $this->phase2() || $this->phase3() || $this->phase4()) {
        // Success!
    }
}

php static in if statement

9 votes

I have a construction like this in my config file:

<?php
if (true) {
    $nonstatic = 1;
    static $config = 1;
}
else {
    $nonstatic = 2;
    static $config = 2;
}

echo $nonstatic;
echo $config;
?>

So why the $config contains 2 if this part of the statement is false and $nonstatic contains 1? Is it a bug?

I suppose this chunk is being included from a function.

Initialisations of static variables are resolved at compile-time, and if the interpreter finds multiple initialisations, it simply takes the bottom one.

How to send the browser to an error page if part of the response has been sent (chunked)

9 votes

This is typical scenario: a page is evaluated, and there's a buffer - once the buffer is full, the part of the page that is evaluated is sent to the browser. This uses the HTTP 1.1 chunked encoding.

However, an error can occur in one of the chunks (after the first one is already sent). In that case:

  • you can't redirect (send a Location header), because the headers and the response status were already sent
  • you can't do server-side redirect (forward), because the new page will have to be rendered after the part that is already sent - it will look ugly for sure.

So what should you do in this case? I asked a question whether you can send a Location header in the chunked trailer, but this is low-level http and the abstraction of languages may not allow it, even if it is possible (and it is likely not to be supported across browsers)

Another option is to send a <script>window.href.location="errorPage"</script> and thus force the client to redirect, but that's ugly. Plus you have to put </script> to close any potential unclosed <script> tag in the partial page.

(I'm tagging major web languages/frameworks, because this is an universal issue)

You cannot redirect from the server in a chunked encoding because the headers have already been sent. The only way to perform a redirect is to send a <script> tag from the server and perform a client side redirect. Just out of curiosity are you trying to implement a COMET server? If this is the case HTML5 WebSockets seem better way (if the browsers you are targeting support them of course) compared to the hidden iframe technique.

How do I block php debug output from being committed in svn?

8 votes

I would like to block debug functions var_dump, print_r, etc... from being commited to the repo so that QA can go over things and not report bugs like "There is a huge block of text on all of the pages!!"

I have tried regex (not a great idea... presumably).

I have also tried token_get_all but for some reason, it returns T_STRING for each of the debug functions, which I guess would work, but it seems odd...

Is there a third better way?

Based on my new understanding, this is what I have:

$debug_functions = array('print_r', 'var_dump', 'var_export');

foreach($files as $file=>$ext){
    $file_contents = file_get_contents($file);
    //break the content into tokens
    $tokens = token_get_all($file_contents);
    foreach($tokens as $t){
        //if the token id is an int (sometimes it isn't)
        if(is_int($t[0])){
            //if it matches our debug stuff...
            if($t[0] == T_STRING && (in_array($t[1], $debug_functions) || preg_match('/xdebug_.*?/', $t[1]))){
                echo 'Debug output '. $t[1] . ' found on line '. $t[2] . PHP_EOL;
            }
        }
    }
}

What ways are there to store information about an anonymous/guest user in a database?

7 votes

Our application has an online shop among other features, and users are normally requested to register before completing a sale, creating a unique customer_ID in the process. When they return, they can log in and their contact details and transaction history are retrieved from the database.

We are now exploring what to do in the case of an 'anonymous' or 'guest' customer, opening up the online shop to customers who don't want to register, and also for sales logged in the backend application, where taking the customer's email, postal address, etc is just too time consuming. The solution has applications outside the online shop too.

Multiple companies use the same database, and the database is built on a party model structure, so we have explored a few options:

  1. Store all anonymous customers under one pre-defined customer_ID in the transaction table:
    1. customer_ID = 0 for every anonymous user, and customer_ID > 0 for every real user
      • This is straight-forward to hard-code into the application
      • But more involved to determine which customers belong to which company
      • Should details for customer_ID = 0 exist in the customer table in the database or as an object in the application?
        • If in the database, what database-level constraints can be made to ensure that it always exists?
        • If not in the database, then foreign key constraints from transaction.customer_ID to customer.customer_ID no longer work
    2. customer_ID is the same as the company party_ID
      • Easier to determine aggregate sales for each company, etc
      • This would confuse matters as it would appear that the company is its own customer, rather than other unique customers
  2. Generate a unique customer_ID for every new anonymous customer (per session)
    • What if the same physical user returns? There will be many records repeating the same sort of data; email, shipping address, etc.
  3. Use another unique key, such as email address, to refer to a customer
    • Not always reliable as people sometimes use more than one email address, or leave old addresses behind.
    • What if there is no email address to be taken, as is the case on the shop floor, pro forma invoices, etc?
  4. Some other Stack Overflow inspired solution!

Addition

A combination of #2 and #3 has been suggested elsewhere - attempt to store a single record for each customer, using the email address if possible, or a new record on every visit if not.

I should point out that we don't need to store a record for every anonymous customer, but it just seems that the relational database was built to deal with relationships, so having a NULL or a customer_ID in the transaction table that doesn't reference an actual customer record just seems wrong...

I must also stress that the purpose of this question is to determine what real-world solutions there are to recording 'casual' transactions where no postal address or email address are given (imagine a supermarket chekout) alongside online shop transactions where an email address and postal address are given whether they are stored or not.

What solutions have the SO community used in the past?

Assuming you require an e-mail address for all online orders, you could create a temporary account for every customer at the completion of each order when they are not logged in.

This can be done by using the shipping address and other information provided during checkout to fill in the account, and e-mailing a random temporary password to them (optionally flagging it to require changing on the first log-in, if that functionality is built into the website). This requires minimal effort on their part to setup the account, and allows them to sign in to check their order status.

Since the primary key in your database is the customer_id, it should not cause conflicts if they continue making new accounts with the same e-mail/address/etc, unless you have code in place to prevent duplicates already. It's rare for someone to create more than one temporary account though, since it's easier to log in with the password e-mailed to them than entering their data again.

For the backend orders, we generally create an account in the same way as above for every customer. However, if they don't have an e-mail address (or they only want to purchase by phone), we generate an account with their shipping information and a blank e-mail address (have to code an exception to not send temporary passwords/order confirmations when it's blank). The customer_id is given to them, and their shipping information and company name are stored in the account to look up and expedite future orders.

Are identical primary keys bad practice?

7 votes

I am trying to create a site where users can register and create a profile, therefore I am using two MySQL tables within a database e.g. users and user_profile.

The users table has an auto increment primary key called user_id.

The user_profile table has the same primary key called user_id however it is not auto increment.

*see note for why I have multiple tables.

When a user signs up, data from the registration form is inserted into users, then the last_insert_id() is input into the user_id field of the user_profile table. I use transactions to ensure this always happens.

My question is, is this bad practice?

Should I have a unique auto increment primary key for the user_profile table, even though one user can only ever have one profile?

Maybe there are other downsides to creating a database like this?

I'd appreciate if anyone can explain why this is a problem or if it's fine, I'd like to make sure my database is as efficient as possible.

Note: I am using seperate tables for user and user_profile because user_profile contains fields that are potentially null and also will be requested much more than the user table, due to the data being displayed on a public profile.

Maybe this is also bad practice and they should be lumped in one table?

I find this a good approach, I'd give bonus point if you use a foreign key relation and preferably cascade when deleting the user from the user table.

As too separated the core user data in one table, and the option profile data in another - good job. Nothing more annoying then a 50 field dragonish entry with 90% empty values.

Pros/cons of MongoDB or MySQL for this purpose

7 votes

I'm looking for a bit of help or guidance on which database to use for a project. If you can raise any points, or note flaws, answer any questions or promote either database type for the purpose I'm about to spell out, I would really appreciate it.

Anyways:

  • We have some software that tracks forms.

  • We have users that can have MANY different properties, literally hundreds of settings, and I'm not a fan of MySQL tables that wide. I really like Mongo for this.

  • We have different types of forms, each can have completely different fields. Right now, we have a list of forms with generic data, then join the relevant table for additional data. I would have all of these fields in one distinct document with Mongo, and I could easily add fields without worrying.

  • We have fees, notes, history on each form. I like how in MySQL they are in a different table, and I can get history by form or by user - same as notes.

  • Our policy is pretty much keep ALL data, even deleted or pre-edited data... forever. Should I be worried about hitting a size limit? We're probably talking 100gb by the end of 2013

  • How many Mongo queries per page will bog things down? 20? 100? Would that change if I had a SSD in the server? (Right now, we have about 60 MySQL queries a page. This can be improved on.)

  • Is it a bad idea for my first Mongo project to be a somewhat major bit of software? Is it something I can learn as I go?

  • I like the case insensitivity of MySQL column names for quick and dirty things.

  • In MySQL, I break things out to different tables. Is it fine, in Mongo, to put data together that CAN be separated? Example: username, email, phone, license1 => [num,isValid], license2 => [num, isValid], notifications => [notification1...notification50000], password hash, salt, setting1, setting2...setting1000, permission1, permission2...permission1000 Of course, I'd make use of the nested style to organize, but is it best to store all this under "user" or break it out to settings, licenses, permissions? Second example: formName, address, notes => [note1 => [user,note,date], note2 => [user,note,date]]

  • Is there any problems with doing a HYBRID setup, where user data is is Mongo, and form data is in MySQL?

  • We have to run a lot of reports, are there limitations on this in Mongo? For example, would I run into problems looking for every form from the past 40 days with a fee over $10, with the fees in each row totaled up, sorted by the age of the user who filled it out?

  • Data redundancy - On the Amazon cloud, MySQL has MASSIVE amounts of redundancy. Is there any service to match that with Mongo? Is it complex to get into setting that up on my own?

  • Is MongoDB supported by any "cloud" providers? AWS does a lot for MySQL, but it looks like I'd be on my own for Mongo

Just a few things off the top of my head - I really do appreciate anything anyone has to say.

We have users that can have MANY different properties, literally hundreds of settings, and I'm not a fan of MySQL tables that wide. I really like Mongo for this.

We have different types of forms, each can have completely different fields. Right now, we have a list of forms with generic data, then join the relevant table for additional data. I would have all of these fields in one distinct document with Mongo, and I could easily add fields without worrying.

From your post i understand that your ultimate aim is to handle the users & forms that contains varying schema(aka schemaless). I believe mongodb is a right choice for this purpose.

We have fees, notes, history on each form. I like how in MySQL they are in a different table, and I can get history by form or by user - same as notes.

No problem, You can use different documents (or embedded documents based on the size of it - 16 mb is the max size of the doc) to handle this without any problems. so you can have the schema like

  Form
   - form field1
   - form field1
   - id of the fees doc
   - id of the notes doc
   - id of the history doc

or (for embedded docs)

  Form
   - form field1
   - form field2
   - embedded fees doc
             - fees field1 
             - fees field2
   - embedded notes doc
             - notes field1 
             - notes field2

Our policy is pretty much keep ALL data, even deleted or pre-edited data... forever. >Should I be worried about hitting a size limit? We're probably talking 100gb by the end of >2013

You will store as much as data you would do, already there are production deployments storing data over Terabytes.

Is it a bad idea for my first Mongo project to be a somewhat major bit of software? Is it something I can learn as I go?

Yes if you are going to use mongodb without prototyping your application model. i would recommend to implement (prototype) a minimal set of your app (like features that sucks in mysql) and learn basics and see how comfortable you are.

I like the case insensitivity of MySQL column names for quick and dirty things.

Mongo enforces the case sensitivity, because thats a nature of BSON (as well JSON) key value pairs.

In MySQL, I break things out to different tables. Is it fine, in Mongo, to put data together that CAN be separated? Example: username, email, phone, license1 => [num,isValid],

Main advantage of mongo over other sql data store is, you can store as much of relevant info within the same document (within the 16 mb size) . If you are unsure about the size or certain parts of data are growing, then you can split the part into another. Since you are concern about the no of queries, it will drastically reduce the number of requests.

Is there any problems with doing a HYBRID setup, where user data is is Mongo, and form data is in MySQL?

No absolutely not, in fact i am currently running mongodb along with mysql(for transactions alone). But if you are not handling any transactions, you can stick with mongodb.

We have to run a lot of reports, are there limitations on this in Mongo? For example, would I run into problems looking for every form from the past 40 days with a fee over $10, with the fees in each row totaled up, sorted by the age of the user who filled it out?

No i don't see any limitation in this. In fact its very fast handling queries with the proper indexes. But there are certain things you can't do with mongo like normal joins, instead you can use map/reduce to handle the data for reports.

Is MongoDB supported by any "cloud" providers? AWS does a lot for MySQL, but it looks like I'd be on my own for Mongo

Mongohq,Mongolab are some of the dedicated managed mongo hosting services available. Also redhat openshift & vmware cloundfoundry provides the hosting platforms for mongo, you can check out the mongo hosting center for more info

Hope this helps

Cheers

Are MD5-hashed passwords from PHP's crypt() portable to the Django password field?

5 votes

I'm porting a bunch of user accounts from a legacy PHP website to a new and shiny Django-based site. A bunch of the passwords are stored as the MD5 hash output from PHP's crypt() function (see the third example there).

Given this password hash from the legacy application:

$1$f1KtBi.v$nWwBN8CP3igfC3Emo0OB8/

How might I convert it to the Django form of md5$<salt>$<hash>? The crypt() MD5 output seems to use a different alphabet than Django's MD5 support (which appears to be using a hexdigest).

Update:

There's a similar (and unanswered) question with an interesting potential solution to convert the PHP hash to a base-16 encoding, but based on some initial poking, it doesn't seem to produce a usable MD5 hexdigest. :(

Concrete example:

A concrete example might help.

Given:

  • a password of foo
  • a salt of $1$aofigrjlh

In PHP, crypt('foo', '$1$aofigrjlh') produces a hash of $1$aofigrjl$xLnO.D8x064D1kDUKWwbX..

crypt() is operating in MD5 mode, but it's some wacky Danish translation of the MD5 algorithm (Update: It's MD5-Crypt). Since Python is a Dutch-derived language, Python's crypt module only supports the DES-style of hashing.

In Python, I need to be able to reproduce that hash, or some regular derivation of it, given the original password and salt.

Unfortunately, it isn't possible to convert those over to Django's format (though there is a possible route you can take that will get your hashes imported, detailed below).

Django's salted md5 algorithm uses a very simple algorithm: md5(salt + password), which is then encoded to hexidecimal.

On the other hand, the hashes output by PHP's crypt() which begin with $1$ are not simple md5 hashes. Instead, they use a password hashing algorithm known as MD5-Crypt. This is much more complex (and secure) than a simple md5 hash. There's a section in the linked page which describes the MD5-Crypt format & algorithm. There is no way to translate it into Django's format, as it doesn't offer support for the algorithm within it's code.

While Django does have code which called Python's stdlib crypt() function, the way Django mangles the hashes means there's no easy way to get a hash beginning with $1$ all the way through Django and into crypt(); and that's the only way to signal to crypt() that you want to use MD5-Crypt instead of the older DES-Crypt.


However, there is a possible route: you can monkeypatch django.contrib.auth.models.User so that it supports both the normal Django hashes, as well as the MD5-Crypt format. That way you can import the hashes unchanged. One way is to do this manually, by overriding the User.set_password and User.check_password methods.

Another alternative is to use the Passlib library, which contains a Django app that was designed to take care of all this, as well as provide cross-platform support for md5-crypt et al. (Disclaimer: I'm the author of that library). Unfortunately that Django plugin is undocumented, because I haven't tested it much outside of my own django deploys... though it works fine for them :) (There is some beta documentation in the source).

In order to use it, install passlib, and add passlib.ext.django to your list of installed apps. Then, within settings.py, add the following:

PASSLIB_CONTEXT = """
[passlib]
schemes =
    md5_crypt,
    django_salted_sha1, django_salted_md5,
    django_des_crypt, hex_md5,
    django_disabled

default = md5_crypt

deprecated = django_des_crypt, hex_md5
"""

This will override User.set_password and User.check_password to use Passlib instead of the builtin code. The configuration string above configures passlib to mimic Django's builtin hashes, but then adds support for md5_crypt, so your hashes should then be accepted as-is.

How to join multiple tables related by other tables

5 votes

I'm developing a site where people can publicate their houses for rent. I'm using php 5.2.0 and MySQL 5+

the publications are stored in a table like this

ta_publications
+---+-------------+------+
|id |    name     | date |
+---+-------------+------+
| 1 | name_001    |  ... |
| 2 | name_002    |  ... |
| 3 | name_003    |  ... |
+---+-------------+------+

I have diferent publications, which have "features" such as "internet", "made service", "satellite tv", etc.

These features might change in the future, and I want to be able to add/remove/modify them, so I store them in the database in a table.

ta_features
+---+-------------+
|id | name        |
+---+-------------+
| 1 | Internet    |
| 2 | Wi-Fi       |
| 3 | satelital tv|
+---+-------------+

which are related to the publications using the following table

ta_publication_features
+---+-------------+----------------+
|id |   type_id   | publication_id |
+---+-------------+----------------+
| 1 |      1      |       1        |
| 2 |      2      |       1        |
| 3 |      3      |       1        |
+---+-------------+----------------+

I think it's pretty easy to understand; There is a publication called name_001 which have internet, wi-fi and satellite tv.

I have the same data-schema for the images, I store them in this table

ta_images
+---+-------------+
|id | src         |
+---+-------------+
| 1 | URL_1       |
| 2 | URL_2       |
| 3 | URL_3       |
+---+-------------+

And use the following table to relate them to the publications

ta_publication_images
+---+-------------+----------------+----------+
|id |  img_id     | publication_id |   order  |
+---+-------------+----------------+----------+
| 1 |      1      |       1        |    0     |
| 2 |      2      |       1        |    1     |
| 3 |      3      |       1        |    2     |
+---+-------------+----------------+----------+

the column order gives the order in wich publications should be displayed when listing a single publication.

Philipp Reichart provided me with a query that will search and get all the publications that have certain features. It works for listing the publications, I can't modified it to return me the data I need.

So I figured I'll run that query and get all of the publications that pass the search criteria and then use another query to list them.

The listing of these publications shall include all publication data (everything on ta_publications)+ all of it's features + the most important (order 0) image src.

I could, for every publication, have two simple querys wich will return, separately, the most important image and all the features it has, but when listing 25 publications per page, it'll be 1 search query + (2 querys per publication * 25 publications) = 51 different querys, clearly not very efficient.

EDIT:

My question is, how can I create a SQL query that, given some publication ids, will return: all publication data (everything on ta_publications) + all of it's features + the most important (order 0) image src

You'll get redundant publication and image data with this one, but here is a way to do it with one query:

   SELECT p.id, p.name, p.date,
           f.id, f.name,
           i.id, i.src
    FROM ta_publications p
    JOIN ta_publication_features pf ON p.id = pf.publication_id
    JOIN ta_features f ON f.id = pf.type_id
    JOIN ta_publication_images pi ON p.id = pi.publication_id 
         AND pi.order = 0
    JOIN ta_images i ON i.id = pi.img_id
    WHERE p.id IN (  -- list of publication ids );

Parse Apache log in PHP using preg_match

5 votes

I need to save data in a table (for reporting, stats etc...) so a user can search by time, user agent etc. I have a script that runs every day that reads the Apache Log and then insert it in the database.

Log format:

10.1.1.150 - - [29/September/2011:14:21:49 -0400] "GET /info/ HTTP/1.1" 200 9955 "http://www.domain.com/download/" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; de-at) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1"

My regex:

preg_match('/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] \"(\S+) (.*?) (\S+)\" (\S+) (\S+) (\".*?\") (\".*?\")$/',$log, $matches);

Now when I print:

print_r($matches);

Array
(
    [0] => 10.1.1.150 - - [29/September/2011:14:21:49 -0400] "GET /info/ HTTP/1.1" 200 9955 "http://www.domain.com/download/" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; de-at) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1"
    [1] => 10.1.1.150
    [2] => -
    [3] => -
    [4] => 29/September/2011
    [5] => 14:21:49
    [6] => -0400
    [7] => GET
    [8] => /info/
    [9] => HTTP/1.1
    [10] => 200
    [11] => 9955
    [12] => "http://www.domain.com/download/"
    [13] => "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; de-at) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1"
)

I get: "http://www.domain.com/download/" and same for user agent. How can I get rid of these " in the regex? Bonus (Is there any quick way to insert the date/time easily)?

Thanks

You could change the regex to:

preg_match('/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] \"(\S+) (.*?) (\S+)\" (\S+) (\S+) "([^"]*)" "([^"]*)"$/',$log, $matches);

I'm not sure what you meant by 'quick way to insert the date/time'; can you clarify that more for me?