Best string questions in March 2012

What makes reference comparison (==) work for some strings in Java?

35 votes

I have following lines of codes to compare String. str1 not equal to str2, which is understandable since it compares object reference. But then why s1 is equal to s2?

String s1 = "abc";
String s2 = "abc";

String str1 = new String("abc");
String str2 = new String("abc");

if (s1==s2)
    System.out.println("s1==s2");           
else
    System.out.println("s1!=s2");

if (str1==str2)
    System.out.println("str1==str2");           
else
    System.out.println("str1!=str2");

if (s1==str1)
    System.out.println("str1==s1");         
else
    System.out.println("str1!=s1");

Output:

  s1==s2
  str1!=str2
  str1!=s1 

The string constant pool will essentially cache all string literals so they're the same object underneath, which is why you see the output you do for s1==s2. It's essentially an optimisation in the VM to avoid creating a new string object each time a literal is declared, which could get very expensive very quickly! With your str1==str2 example, you're explicitly telling the VM to create new string objects, hence why it's false.

As an aside, calling the intern() method on any string will add it to the constant pool (and return the String that it's added to the pool.) It's not necessarily a good idea to do this however unless you're sure you're dealing with strings that will definitely be used as constants, otherwise you may end up creating hard to track down memory leaks.

Why are strings immutable in many programming languages?

20 votes

Possible Duplicate:
Why can't strings be mutable in Java and .NET?
Why .NET String is immutable?

Several languages have chosen for this, such as C#, Java, C++, and Python. If it is intended to save memory or gain efficiency for operations like compare, what effect does it have on concatenation and other modifying operations?

Immutable types are a good thing generally:

  • They work better for concurrency (you don't need to lock something that can't change!)
  • They reduce errors: mutable objects are vulnerable to being changed when you don't expect it which can introduce all kinds of strange bugs ("action at a distance")
  • They can be safely shared (i.e. multiple references to the same object) which can reduce memory consumption and improve cache utilisation.
  • Sharing also makes copying a very cheap O(1) operation when it would be O(n) if you have to take a defensive copy of a mutable object. This is a big deal because copying is an incredibly common operation (e.g. whenever you want to pass parameters around....)

As a result, it's a pretty reasonable language design choice to make strings immutable.

Some languages (particularly functional languages like Haskell and Clojure) go even further and make pretty much everything immutable. This enlightening video is very much worth a look if you are interested in the benefits of immutability.

There are a couple of minor downsides for immutable types:

  • Operations that create a changed string like concatenation are more expensive because you need to construct new objects. Typically the cost is O(n+m) for concatenating two immutable Strings, though it can go as low as O(log (m+n)) if you use a tree-based string data structure like a Rope. Plus you can always use special tools like Java's StringBuilder if you really need to concatenate Strings efficiently.
  • A small change on a large string can result in the need to construct a completely new copy of the large String, which obviously increases memory consumption. Note however that this isn't usually a big issue in garbage-collected languages since the old copy will get garbage collected pretty quickly if you don't keep a reference to it.

Overall though, the advantages of immutability vastly outweigh the minor disadvantages. Even if you are only interested in performance, the concurrency advantages and cheapness of copying will in general make immutable strings much more performant than mutable ones with locking and defensive copying.

String.Concat inefficient code?

18 votes

I was investigating String.Concat : (Reflector)

enter image description here

very strange :

the have the values array ,

they creating a NEW ARRAY for which later they send him to ConcatArray.

Question :

Why they created a new array ? they had values from the first place...

Well for one thing, it means that the contents of the new array can be trusted to be non-null.... and unchanging.

Without that copying, another thread could modify the original array during the call to ConcatArray, which presumably could throw an exception or even trigger a security bug. With the copying, the input array can be changed at any time - each element will be read exactly once, so there can be no inconsistency. (The result may be a mixture of old and new elements, but you won't end up with memory corruption.)

Suppose ConcatArray is trusted to do bulk copying out of the strings in the array it's passed, without checking for buffer overflow. Then if you change the input array at just the right time, you could end up writing outside the allocated memory. Badness. With this defensive copy, the system can be sure1 that the total length really is the total length.


1 Well, unless reflection is used to change the contents of a string. But that can't be done without fairly high permissions - whereas changing the contents of an array is easy.

Can anyone explain this bizarre JS behavior concerning string concatenation?

10 votes

I just posted this to a gist: https://gist.github.com/2228570

var out = '';

function doWhat(){
    out += '<li>';
    console.log(out === '<li>'); // at this point, out will equal '<li>'
    return '';
}

out += doWhat();
console.log(out, out === '<li>');
// I expect out to == '<li>', but it's actually an empty string!?

This behavior is odd, does anyone have an explanation? This is a tough thing to google. It also makes no difference if you use out += or out = out +.

EDIT: @paislee made a JSFiddle that demonstrates how if doWhat is on a separate line, it behaves as expected: http://jsfiddle.net/paislee/Y4WE8/

It seems you're expecting doWhat to be called before the += is evaluated.

But, the progression of the line is:

out += doWhat();      // original line
out = out + doWhat(); // expand `+=`
out = '' + doWhat();  // evaluate `out`, which is currently an empty string
out = '' + '';        // call `doWhat`, which returns another empty string
out = '';             // result

The out += '<li>'; inside doWhat is updating the variable, but too late to have a lasting effect.

How to prevent user from reading strings stored in stack?

6 votes

Here's a minimal test case:

#include <stdio.h>
#include <stdlib.h>

int main ( int argc , char **argv )
{
        const char abc [15] = "abcdefg\0";
        printf ("%s\n" , abc);
        return 0;
}

And you do strings test , you should see abcdefg , as it's stored in read only area.

So , what's the best way to prevent user from reading this string , with "strings" command , e.g I don't want users to know my SQL phrase

One solution would be to write an additional program that runs as another user, and read credentials from a location where it is not accessible by users you want to protect credentials from. This program would expose an API (through TCP/IP or any message passing interface or remote procedure call) that do not need to connect to the database directly, but responds only to requests you're interested in.

Another approach is to set the setuid bit on your program, and read credentials from a location where users have no read access. Give the program an owner that is allowed to read the file containing the query, using chown. When executed, your program will obtain privileges to read the file.

Like said in Nawaz answer (and Binyamin Sharet), you could use obfuscation techniques to make it harder to read the query (in particular, it would not work with strings anymore), but keep in mind that someone with more knowledge will be able to find the string using a deassembler or a debugger, or simply by running your program in strace. It makes this approach unsuitable to store sensitive information, like connection credentials: as long as a binary can connect, it contains credential, anyone with some knowledge in computer security know that and may revert engineer your program to retrieve your password.

As a general guideline, if you need to protect information from a user executing your program, never giving this information to the program is the only way to make sure it can't be read.

Why does String.split need pipe delimiter to be escaped?

5 votes

I am trying to parse a file that has each line with pipe delimited values. It did not work correctly when i did not escape the pipe delimiter in split method, it worked correctly after i escaped the pipe as below.

private ArrayList<String> parseLine(String line) {
    ArrayList<String> list = new ArrayList<String>();
    String[] list_str = line.split**("\\|");**
    System.out.println(list_str.length);
    System.out.println(line);
    for(String s:list_str) {
        list.add(s);
        System.out.print(s+ "|");
    }
    System.out.println();
    //System.out.println("line: " +line);
    //System.out.println("splt: " +list);
    return list;
}

can someone explain why?

String.split expects a regular expression argument. An unescaped | is parsed as a regex meaning "empty string or empty string," which isn't what you mean.