Best perl questions in March 2012

Seeking clarification on apparent contradictions regarding weakly typed languages

52 votes

I think I understand strong typing, but every time I look for examples for what is weak typing I end up finding examples of programming languages that simply coerce/convert types automatically.

For instance, in this article named Typing: Strong vs. Weak, Static vs. Dynamic says that Python is strongly typed because you get an exception if you try to:

Python

1 + "1"
Traceback (most recent call last):
File "", line 1, in ? 
TypeError: unsupported operand type(s) for +: 'int' and 'str'

However, such thing is possible in Java and in C#, and we do not consider them weakly typed just for that.

Java

  int a = 10;
  String b = "b";
  String result = a + b;
  System.out.println(result);

C#

int a = 10;
string b = "b";
string c = a + b;
Console.WriteLine(c);

In this another article named Weakly Type Languages the author says that Perl is weakly typed simply because I can concatenate a string to a number and viceversa without any explicit conversion.

Perl

$a=10;
$b="a";
$c=$a.$b;
print $c; #10a

So the same example makes Perl weakly typed, but not Java and C#?.

Gee, this is confusing enter image description here

The authors seem to imply that a language that prevents the application of certain operations on values of different types is strongly typed and the contrary means weakly typed.

Therefore, at some point I have felt prompted to believe that if a language provides a lot of automatic conversions or coercion between types (as perl) may end up being considered weakly typed, whereas other languages that provide only a few conversions may end up being considered strongly typed.

I am inclined to believe, though, that I must be wrong in this interepretation, I just do not why or how to explain it.

So, my questions are:

  • What does it really mean for a language to be truly weakly typed?
  • Could you mention any good examples of weakly typing that are not related to automatic conversion/automatic coercion done by the language?
  • Can a language be weakly typed and strongly typed at the same time?

Thanks in advance for any references, use cases or examples that you provide that can lead me into the right direction.

What does it really mean for a language to be "weakly typed"?

It means "this language uses a type system that I find distasteful". A "strongly typed" language by contrast is a language with a type system that I find pleasant.

The terms are essentially meaningless and you should avoid them. Wikipedia lists eleven different meanings for "strongly typed", several of which are contradictory. This indicates that the odds of confusion being created are high in any conversation involving the term "strongly typed" or "weakly typed".

All that you can really say with any certainty is that a "strongly typed" language under discussion has some additional restriction in the type system, either at runtime or compile time, that a "weakly typed" language under discussion lacks. What that restriction might be cannot be determined without further context.

Instead of using "strongly typed" and "weakly typed", you should describe in detail what kind of type safety you mean. For example, C# is a statically typed language and a type safe language and a memory safe language, for the most part. C# allows all three of those forms of "strong" typing to be violated. The cast operator violates static typing; it says to the compiler "I know more about the runtime type of this expression than you do". If the developer is wrong, then the runtime will throw an exception in order to protect type safety. If the developer wishes to break type safety or memory safety, they can do so by turning off the type safety system by making an "unsafe" block. In an unsafe block you can use pointer magic to treat an int as a float (violating type safety) or to write to memory you do not own. (Violating memory safety.)

C# imposes type restrictions that are checked at both compile-time and at runtime, thereby making it a "strongly typed" language compared to languages that do less compile-time checking or less runtime checking. C# also allows you to in special circumstances do an end-run around those restrictions, making it a "weakly typed" language compared with languages which do not allow you to do such an end-run.

Which is it really? It is impossible to say; it depends on the point of view of the speaker and their attitude towards the various language features.

Decrypt obfuscated perl script

12 votes

Had some spam issues on my server and, after finding out and removing some Perl and PHP scripts I'm down to checking what they really do, although I'm a senior PHP programmer I have little experience with Perl, can anyone give me a hand with the script here:

http://pastebin.com/MKiN8ifp

(It was one long line of code, script was called list.pl)


The start of the script is:

$??s:;s:s;;$?::s;(.*); ]="&\%[=.*.,-))'-,-#-*.).<.'.+-<-~-#,~-.-,.+,~-{-,.<'`.{'`'<-<--):)++,+#,-.{).+,,~+{+,,<)..})<.{.)-,.+.,.)-#):)++,+#,-.{).+,,~+{+,,<)..})<*{.}'`'<-<--):)++,+#,-.{).+:,+,+,',~+*+~+~+{+<+,)..})<'`'<.{'`'<'<-}.<)'+'.:*}.*.'-|-<.+):)~*{)~)|)++,+#,-.{).+:,+,+,',~+*+~+~+{+<+,)..})

It continues with precious few non-punctuation characters until the very end:

0-9\;\\_rs}&a-h;;s;(.*);$_;see;

Replace the s;(.*);$_;see; with print to get this. Replace s;(.*);$_;see; again with print in the first half of the payload to get this, which is the decryption code. The second half of the payload is the code to decrypt, but I can't go any further with it, because as you see, the decryption code is looking for a key in an envvar or a cookie (so that only the script's creator can control it or decode it, presumably), and I don't have that key. This is actually reasonably cleverly done.

How to handle utf8 on the command line (using Perl or Python)?

12 votes

How can I handle utf8 using Perl (or Python) on the command line?

I am trying to split the characters in each word, for example. This is very easy for non-utf8 text, for example:

$ echo "abc def" | perl -ne 'my @letters = m/(.)/g; print "@letters\n"' | less
a b c   d e f

But with utf8 it doesn't work, of course:

$ echo "одобрение за" | perl -ne 'my @letters = m/(.)/g; print "@letters\n"' | less
<D0> <BE> <D0> <B4> <D0> <BE> <D0> <B1> <D1> <80> <D0> <B5> <D0> <BD> <D0> <B8> <D0> <B5>   <D0> <B7> <D0> <B0>

because it doesn't know about the 2-byte characters.

It would also be good to know how this (i.e., command-line processing of utf8) is done in Python.

The "-C" flag controls some of the Perl Unicode features (see perldoc perlrun):

$ echo "одобрение за" | perl -C -pe 's/.\K/ /g'
о д о б р е н и е   з а 

To specify encoding used for stdin/stdout you could use PYTHONIOENCODING environment variable:

$ echo "одобрение за" | PYTHONIOENCODING=utf-8 python -c'import sys
for line in sys.stdin:
    print " ".join(line.decode(sys.stdin.encoding)),
'
о д о б р е н и е   з а 

If you'd like to split the text on characters (grapheme) boundaries (not on codepoints as the code above) then you could use /\X/ regular expression:

$ echo "одобрение за" | perl -C -pe 's/\X\K/ /g'
о д о б р е н и е   з а 

See Grapheme Cluster Boundaries

In Python \X is supported by regex module.

What is Python's equivalent of "perl -V"

11 votes

The output produced by running perl -V is packed with useful information (see example below). Is there anything like it for Python?


Example output:

% perl -V
Summary of my perl5 (revision 5 version 10 subversion 1) configuration:

  Platform:
    osname=linux, osvers=2.6.32-5-amd64, archname=x86_64-linux-gnu-thread-multi
    uname='linux brahms 2.6.32-5-amd64 #1 smp tue jun 14 09:42:28 utc 2011 x86_64 gnulinux '
    config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=x86_64-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.10 -Darchlib=/usr/lib/perl/5.10 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.10.1 -Dsitearch=/usr/local/lib/perl/5.10.1 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -Dlibperl=libperl.so.5.10.1 -Dd_dosuid -des'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=define, use64bitall=define, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2 -g',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
    ccversion='', gccversion='4.4.5', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib /lib64 /usr/lib64
    libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
    perllibs=-ldl -lm -lpthread -lc -lcrypt
    libc=/lib/libc-2.11.2.so, so=so, useshrplib=true, libperl=libperl.so.5.10.1
    gnulibc_version='2.11.2'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib -fstack-protector'


Characteristics of this binary (from libperl): 
  Compile-time options: MULTIPLICITY PERL_DONT_CREATE_GVSV
                        PERL_IMPLICIT_CONTEXT PERL_MALLOC_WRAP USE_64_BIT_ALL
                        USE_64_BIT_INT USE_ITHREADS USE_LARGE_FILES
                        USE_PERLIO USE_REENTRANT_API
  Locally applied patches:
    DEBPKG:debian/arm_thread_stress_timeout - http://bugs.debian.org/501970 Raise the timeout of ext/threads/shared/t/stress.t to accommodate slower build hosts
    DEBPKG:debian/cpan_config_path - Set location of CPAN::Config to /etc/perl as /usr may not be writable.

    <snip-- iow patches galore --you get the picture>

    DEBPKG:fixes/safe-reval-rdo-cve-2010-1447 - [PATCH] Wrap by default coderefs returned by rdo and reval
    DEBPKG:patchlevel - http://bugs.debian.org/567489 List packaged patches for 5.10.1-17squeeze2 in patchlevel.h
  Built under linux
  Compiled at Jun 30 2011 22:28:00
  @INC:
    /etc/perl
    /usr/local/lib/perl/5.10.1
    /usr/local/share/perl/5.10.1
    /usr/lib/perl5
    /usr/share/perl5
    /usr/lib/perl/5.10
    /usr/share/perl/5.10
    /usr/local/lib/site_perl
    /usr/local/lib/perl/5.10.0
    /usr/local/share/perl/5.10.0
    .

Not to be confused with the much less informative perl -v:

% perl -v
This is perl, v5.10.1 (*) built for x86_64-linux-gnu-thread-multi
(with 53 registered patches, see perl -V for more detail)

Copyright 1987-2009, Larry Wall

Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl".  If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.

python -c 'import sysconfig, pprint; pprint.pprint(sysconfig.get_config_vars())'

Strange "half to even" rounding in different languages

11 votes

GNU bash, version 4.2.24:

$> printf "%.0f, %.0f\n" 48.5 49.5
48, 50

Ruby 1.8.7

> printf( "%.0f, %.0f\n", 48.5, 49.5 )
48, 50

Perl 5.12.4

$> perl -e 'printf( "%.0f, %.0f\n", 48.5, 49.5 )'
48, 50

gcc 4.5.3:

> printf( "%.0f, %.0f\n", 48.5, 49.5 );
48, 50

GHC, version 7.0.4:

> printf "%.0f, %.0f\n" 48.5 49.5
49, 50

Wikipedia says that this kind of rounding is called round half to even:

This is the default rounding mode used in IEEE 754 computing functions and operators.

Why is this rounding used by default in C, Perl, Ruby and bash, but not in Haskell?

Is it some sort of tradition or standard? And if it is a standard, why it's used by those languages and not used by Haskell? What is a point of rounding half to even?

GHCi> round 48.5
48
GHCi> round 49.5
50

The only difference is that printf isn't using round — presumably because it has to be able to round to more than just whole integers. I don't think IEEE 754 specifies anything about how to implement printf-style formatting functions, just rounding, which Haskell does correctly.

It would probably be best if printf was consistent with round and other languages' implementations, but I don't think it's really a big deal.

Searching and marking paired patterns on a line

8 votes

I need to search for and mark patterns which are split somewhere on a line. Here is a shortened list of sample patterns which are placed in a separate file, e.g.:

CAT,TREE
LION,FOREST
OWL,WATERFALL

A match appears if the item from column 2 ever appears after and on the same line as the item from column 1. E.g.:

THEREISACATINTHETREE. (matches)

No match appears if the item from column 2 appears first on the line, e.g.:

THETREEHASACAT. (does not match)

Furthermore, no match appears if the item from column 1 and 2 touch, e.g.:

THECATTREEHASMANYBIRDS. (does not match)

Once any match is found, I need to mark it with \start{n} (appearing after the column 1 item) and \end{n} (appearing before the column 2 item), where n is a simple counter which increases anytime any match is found. E.g.:

THEREISACAT\start{1}INTHE\end{1}TREE.

Here is a more complex example:

THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.

This becomes:

THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}INA\end{1}TREENEARTHE\end{3}WATERFALL.

Sometimes there are multiple matches in the same place:

 THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.

This becomes:

 THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.
  • There are no spaces in the file.
  • Many non-Latin characters appear in the file.
  • Pattern matches need only be found on the same line (e.g. "CAT" on line 1 does not ever match with a "TREE" found on line 2, as those are on different lines).

How can I find these matches and mark them in this way?

Check this out (Ruby):

#!/usr/bin/env ruby
patterns = [
  ['CAT', 'TREE'],
  ['LION', 'FOREST'],
  ['OWL', 'WATERFALL']
]

lines = [
  'THEREISACATINTHETREE.',
  'THETREEHASACAT.',
  'THECATTREEHASMANYBIRDS.',
  'THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.',
  'THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.',
  'CAT...TREE...CAT...TREE'
]

lines.each do |line|
  puts line
  matches = Hash.new{|h,e| h[e] = [] }
  match_indices = []
  patterns.each do |first,second|
    offset = 0
    while new_offset = line.index(first,offset) do
      # map second element of the pattern to minimal position it might be matched
      matches[second] << new_offset + first.size + 1
      offset = new_offset + 1
    end
  end
  global_counter = 1
  matches.each do |second,offsets|
    offsets.each do |offset|
      second_offset = offset
      while new_offset = line.index(second,second_offset) do
        # register the end index of the first pattern and 
        # the start index of the second pattern with the global match count
        match_indices << [offset-1,new_offset,global_counter]
        second_offset = new_offset + 1
        global_counter += 1
      end
    end
  end
  indices = Hash.new{|h,e| h[e] = ""}
  match_indices.each do |first,second,global_counter|
    # build the insertion string for the string positions the 
    # start and end tags should be placed in
    indices[first] << "\\start{#{global_counter}}"
    indices[second] << "\\end{#{global_counter}}"
  end
  inserted_length = 0
  indices.sort_by{|k,v| k}.each do |position,insert|
    # insert the tags at their positions
    line.insert(position + inserted_length,insert)
    inserted_length += insert.size
  end
  puts line
end

Result

THEREISACATINTHETREE.
THEREISACAT\start{1}INTHE\end{1}TREE.
THETREEHASACAT.
THETREEHASACAT.
THECATTREEHASMANYBIRDS.
THECATTREEHASMANYBIRDS.
THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.
THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}IN\end{1}TREENEARTHE\end{3}WATERFALL.
THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.
THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.
CAT...TREE...CAT...TREE
CAT\start{1}\start{2}...\end{1}TREE...CAT\start{3}...\end{2}\end{3}TREE

EDIT

I inserted some comments and clarified some of the variables.

Read unbuffered data from pipe in Perl

7 votes

I am trying to read unbufferd data from a pipe in Perl. For example in the program below:

open FILE,"-|","iostat -dx 10 5";
$old=select FILE;
$|=1;
select $old;
$|=1;

foreach $i (<FILE>) {
  print "GOT: $i\n";
}

iostat spits out data every 10 seconds (five times). You would expect this program to do the same. However, instead it appears to hang for 50 seconds (i.e. 10x5), after which it spits out all the data.

How can I get the to return whatever data is available (in an unbuffered manner), without waiting all the way for EOF?

P.S. I have seen numerous references to this under Windows - I am doing this under Linux.

#!/usr/bin/env perl

use strict;
use warnings;



open(PIPE, "iostat -dx 10 1 |")       || die "couldn't start pipe: $!";

while (my $line = <PIPE>) {
    print "Got line number $. from pipe: $line";
}

close(PIPE)                           || die "couldn't close pipe: $! $?";

Perl pattern matching when using arrays

7 votes

I have a strange problem in matching a pattern.

Consider the Perl code below

#!/usr/bin/perl -w

use strict;
my @Array = ("Hello|World","Good|Day");

function();
function();
function();

sub function 
{
  foreach my $pattern (@Array)  
  {
    $pattern =~ /(\w+)\|(\w+)/g;
    print $1."\n";
  }
    print "\n";
}

__END__

The output I expect should be


Hello
Good

Hello
Good

Hello
Good

But what I get is

Hello
Good

Use of uninitialized value $1 in concatenation (.) or string at D:\perlfiles\problem.pl li
ne 28.
Use of uninitialized value $1 in concatenation (.) or string at D:\perlfiles\problem.pl li
ne 28.

Hello
Good

What I observed was that the pattern matches alternatively.
Can someone explain me what is the problem regarding this code.
To fix this I changed the function subroutine to something like this:

sub function 
{
    my $string;
    foreach my $pattern (@Array)
    {
        $string .= $pattern."\n";
    }
    while ($string =~ m/(\w+)\|(\w+)/g)
    {
            print $1."\n";
    }
    print "\n";
}

Now I get the output as expected.

It is the global /g modifier that is at work. It remembers the position of the last pattern match. When it reaches the end of the string, it starts over.

Remove the /g modifier, and it will act as you expect.

Can BerkeleyDB in perl handle a hash of hashes of hashes (up to n)?

6 votes

I have a script that utilizes a hash, which contains four strings as keys whose values are hashes. These hashes also contain four strings as keys which also have hashes as their values. This pattern continues up to n-1 levels, which is determined at run-time. The nth-level of hashes contain integer (as opposed to the usual hash-reference) values.

I installed the BerkeleyDB module for Perl so I can use disk space instead of RAM to store this hash. I assumed that I could simply tie the hash to a database, and it would work, so I added the following to my code:

my %tags = () ; 
my $file = "db_tags.db" ; 
unlink $file; 


tie %tags, "BerkeleyDB::Hash", 
        -Filename => $file, 
        -Flags => DB_CREATE
     or die "Cannot open $file\n" ;

However, I get the error:

Can't use string ("HASH(0x1a69ad8)") as a HASH ref while "strict refs" in use at getUniqSubTreeBDB.pl line 31, line 1.

To test, I created a new script, with the code (above) that tied to hash to a file. Then I added the following:

my $href = \%tags; 
$tags{'C'} = {} ;

And it ran fine. Then I added:

$tags{'C'}->{'G'} = {} ;

And it would give pretty much the same error. I am thinking that BerkeleyDB cannot handle the type of data structure I am creating. Maybe it was able to handle the first level (C->{}) in my test because it was just a regular key -> scaler?

Anyways, any suggestions or affirmations of my hypothesis would be appreciated.

Use DBM::Deep.

my $db = DBM::Deep->new( "foo.db" );

$db->{mykey} = "myvalue";
$db->{myhash} = {};
$db->{myhash}->{subkey} = "subvalue";

print $db->{myhash}->{subkey} . "\n";

The code I provided yesterday would work fine with this.

sub get_node {
   my $p = \shift;
   $p = \( ($$p)->{$_} ) for @_;
   return $p;
}

my @seqs = qw( CG CA TT CG );

my $tree = DBM::Deep->new("foo.db");
++${ get_node($tree, split //) } for @seqs;