Best c questions in February 2012

How are gcc/g++ bootstrapped?

85 votes

This has been bugging me for a while. How do gcc/g++ compile themselves? I'm guessing that every revision gets compiled with a previously built revision. Is this true? And if it is, does it mean that the oldest g++/gcc versions were written in assembly?

The oldest version of GCC was compiled using another C compiler, since there were others when it was written. The very first C compiler ever (ca. 1973, IIRC) was implemented in PDP-11 assembly. Similarly, the first ever C++ compiler (CPre/Cfront, 1979-1983) were probably first implemented in C, then rewritten in C++.

When you compile GCC or any other self-hosting compiler, the full order of building is:

  1. Build new version of GCC with existing C compiler
  2. re-build new version of GCC with the one you just built
  3. (optional) repeat step 2 for verification purposes.

This process is called bootstrapping. It tests the compiler's capability of compiling itself and makes sure that the resulting compiler is built with all the optimizations that it itself implements.

EDIT: Drew Dormann, in the comments, points to Bjarne Stroustrup's account of the earliest implementation of C++. It was implemented in C++ but translated by Stroustrup calls a "preprocessor" from C++ to C; not a full compiler by his definition, but still C++ was bootstrapped in C.

How to get 100% CPU usage from a C program

44 votes

This is quite an interesting question so let me set the scene. I work at The National Museum of Computing, and we have just managed to get a Cray Y-MP EL super computer from 1992 running, and we really want to see how fast it can go!

We decided the best way to do this was to write a simple C program that would calculate prime numbers and show how long it took to do so, then run the program on a fast modern desktop PC and compare the results.

We quickly came up with this code to count prime numbers:

#include <stdio.h>
#include <time.h>

void main() {
    clock_t start, end;
    double runTime;
    start = clock();
    int i, num = 1, primes = 0;

    while (num <= 1000) { 
        i = 2; 
        while (i <= num) { 
            if(num % i == 0)
                break;
            i++; 
        }
        if (i == num)
            primes++;

        system("clear");
        printf("%d prime numbers calculated\n",primes);
        num++;
    }

    end = clock();
    runTime = (end - start) / (double) CLOCKS_PER_SEC;
    printf("This machine calculated all %d prime numbers under 1000 in %g seconds\n", primes, runTime);
}

Which on our dual core laptop running Ubuntu (The Cray runs UNICOS), worked perfectly, getting 100% CPU usage and taking about 10 minutes or so. When I got home I decided to try it on my hex-core modern gaming PC, and this is where we get our first issues.

I first adapted the code to run on Windows since that is what the gaming PC was using, but was saddened to find that the process was only getting about 15% of the CPU's power. I figured that must be Windows being Windows, so I booted into a Live CD of Ubuntu thinking that Ubuntu would allow the process to run with its full potential as it had done earlier on my laptop.

However I only got 5% usage! So my question is, how can I adapt the program to run on my gaming machine in either Windows 7 or live Linux at 100% CPU utilisation? Another thing that would be great but not necessary is if the end product can be one .exe that could be easily distributed and ran on Windows machines.

Thanks a lot!

P.S. Of course this program didn't really work with the Crays 8 specialist processors, and that is a whole other issue... If you know anything about optimising code to work on 90's Cray super computers give us a shout too!

If you want 100% CPU, you need to use more than 1 core. To do that, you need multiple threads.

Here's a parallel version using OpenMP:

I had to increase the limit to 1000000 to make it take more than 1 second on my machine.

#include <stdio.h>
#include <time.h>
#include <omp.h>

int main() {
    double start, end;
    double runTime;
    start = omp_get_wtime();
    int num = 1,primes = 0;

    int limit = 1000000;

#pragma omp parallel for schedule(dynamic) reduction(+ : primes)
    for (num = 1; num <= limit; num++) { 
        int i = 2; 
        while(i <= num) { 
            if(num % i == 0)
                break;
            i++; 
        }
        if(i == num)
            primes++;
//      printf("%d prime numbers calculated\n",primes);
    }

    end = omp_get_wtime();
    runTime = end - start;
    printf("This machine calculated all %d prime numbers under %d in %g seconds\n",primes,limit,runTime);

    return 0;
}

Output:

This machine calculated all 78498 prime numbers under 1000000 in 29.753 seconds

Here's your 100% CPU:

enter image description here

Strange C/C++ syntax

19 votes

Possible Duplicate:
What's this C++ syntax that puts a brace-surrounded block where an expression is expected?

I've just come across this strange C/C++ syntax:

#include <stdio.h>
int main() {
    printf("%s",
        ({
        static char b__[129];
        b__[0] = 55;
        b__[1] = 55;
        b__[2] = 0;
        b__;
        })
    );
}

This compiles and runs fine using both gcc and g++ (4.5.2). This is the first time I see something like this, and I wonder what exactly this syntax means. I've tried to Google it, but I have no idea what this construct is called.

They're called statement expressions, it's a GNU extension. In your example the result of the expression is b__.

About Pointers To Functions

18 votes
#include<stdio.h>
#include<stdlib.h>

int fun1()
{
    printf("I am fun1.");
    return 0;
}

int fun2(int fun())
{
    fun();
    return 0;
}

int main()
{
    fun2(fun1);
    return 0;
}

The above program can run. As far as I am concerned, I can understand int fun2(int (*fun)()), but I do not know how int fun2(int fun()) works. Thank you.

When you write int fun2(int fun()), the parameter int fun() converts into int (*fun)(), it becomes exactly equivalent to this:

int fun2(int (*fun)());

A more famiiar conversion happens in case of array when you declare it as function parameter. For example, if you've this:

int f(int a[100]);

Even here the parameter type converts into int*, and it becomes this:

int f(int *a);

The reason why function type and array type converts into function pointer type, and pointer type, respectively, is because the Standard doesn't allow function and array to be passed to a function, neither can you return function and array from a function. In both cases, they decay into their pointer version.

The C++03 Standard says in §13.1/3 (and it is same in C++11 also),

Parameter declarations that differ only in that one is a function type and the other is a pointer to the same function type are equivalent. That is, the function type is adjusted to become a pointer to function type (8.3.5).

And a more interesting discussion is here:

Does the current C standard prohibit short-circuiting `&` and `|`?

16 votes

Is there anything in the C standard (I guess at the moment that's C99 + TC1-3 C11) that guarantees that & and | will not be short-circuited?

If I write:

x = y & foo();

...I expect foo will always get called, but is that really defined? In theory, barring the standard saying otherwise, if y contained 0, a runtime optimization could skip the call in the absense of something saying that's not allowed. (And similarly with |, you could ignore the right-hand operand if the left-hand operand were already all-bits-on. For that matter, even x = y * foo(); could be short-circuited if y were 0.)

Not knowing the specification well (and I don't), it's tricky to prove a negative like that. I can contrast the sections on & (6.5.10 in C99) and && (6.5.13 in C99). In the latter, it's perfectly clear:

Unlike the bitwise binary & operator, the && operator guarantees left-to-right evaluation; there is a sequence point after the evaluation of the first operand. If the first operand compares equal to 0, the second operand is not evaluated.

...but 6.5.10 doesn't specifically state the negative version of that.

It seems reasonable to me to take the fact that 6.5.10 doesn't define a sequence point to mean that foo will always get called and an implementation that didn't call it would be non-standard. Am I right about that?

It seems reasonable to me to take the fact that 6.5.10 doesn't define a sequence point to mean that foo will always get called and an implementation that didn't call it would be non-standard. Am I right about that?

Yes and no. Indeed, the implementation that wouldn't call foo would be nonstandard. However, it doesn't have anything to do with sequence points.

The paragraph that would apply here would be 5.1.2.3/3:

In the abstract machine, all expressions are evaluated as specified by the semantics. An actual implementation need not evaluate part of an expression if it can deduce that its value is not used and that no needed side effects are produced (including any caused by calling a function or accessing a volatile object).

Why is the executable so big? (Why isn't dead code removed?)

16 votes

Compilng and linking this file results in a 1-KiB executable:

#pragma comment(linker, "/Entry:mainCRTStartup") // No CRT code (reduce size)
#pragma comment(linker, "/Subsystem:Console")    // Needed if avoiding CRT

#define STRINGIFIER(x)    func##x
#define STRINGIFY(x)      STRINGIFIER(x)
#define G   int STRINGIFY(__COUNTER__)(void) { return __COUNTER__; }

int mainCRTStartup(void) { return 0; }  // Does nothing

#if 0
    // Every `G' generates a new, unused function
    G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G
    G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G
#endif

When you change #if 0 to #if 1), the output size doubles to 2 KiB.

It seems to do this with all versions of Visual C++ to date, even though my command-line options contain all optimizations I could think of:

/Ox /MD /link /fixed /OPT:ICF /OPT:REF

and, specifically, I did not include any debugging information.

Does anyone know why /OPT:REF is not causing the linker to remove the unused functions?

In broad terms... the compiler generates code in "object records" that contains a bunch of assembly code and supporting information. The linker links these object records together to create an executable.

Often a compiler will create a single object record for an entire source file. In this case, the linker can only decide to link in the entire object record, or not. Since there is at least one function in the object record that is used, it must link in all of it.

On some compilers, you can tell it to generate a separate object record for each function (an object file can have multiple object records). In this case, the linker can make the decision to omit some of the object records if they're never called.

From the Microsoft documentation for /OPT:

/OPT:REF

LINK removes unreferenced packaged functions by default. An object contains packaged functions (COMDATs) if it has been compiled with the /Gy option. This optimization is called transitive COMDAT elimination. To override this default and keep unreferenced COMDATs in the program, specify /OPT:NOREF. You can use the /INCLUDE option to override the removal of a specific symbol.

The /Gy compiler option enables function-level linking.

For reference, this feature also exists in gcc:

-ffunction-sections
-fdata-sections

Place each function or data item into its own section in the output file if the target supports arbitrary sections. The name of the function or the name of the data item determines the section’s name in the output file.

Use these options on systems where the linker can perform optimizations to improve locality of reference in the instruction space. Most systems using the ELF object format and SPARC processors running Solaris 2 have linkers with such optimizations. AIX may have these optimizations in the future.

Only use these options when there are significant benefits from doing so. When you specify these options, the assembler and linker will create larger object and executable files and will also be slower. You will not be able to use "gprof" on all systems if you specify this option and you may have problems with debugging if you specify both this option and -g.

And the companion option in ld:

--gc-sections

Enable garbage collection of unused input sections. It is ignored on targets that do not support this option. This option is not compatible with -r or --emit-relocs. The default behaviour (of not performing this garbage collection) can be restored by specifying --no-gc-sections on the command line.

Replacing extrordinarily slow pow() function

15 votes

We have a CFD solver and while running a simulation, it was found to run extraordinarily slow on some machines but not others. Using Intel VTune, it was found the following line was the problem (in Fortran):

RHOV= RHO_INF*((1.0_wp - COEFF*EXP(F0)))**(1.0_wp/(GAMM - 1.0_wp))

Drilling in with VTune, the problem was traced to the call pow assembly line and when tracing the stack, it showed it was using __slowpow(). After some searching, this page showed up complaining about the same thing.

On the machine with libc version 2.12, the simulation took 18 seconds. On the machine with libc version 2.14, the simulation took 0 seconds.

Based on the information on the aforementioned page, the problem arises when the base to pow() is close to 1.0. So we did another simple test where we scaled the base by an arbitrary number before the pow() and then divided by the number raised to the exponent after the pow() call. This dropped the runtime from 18 seconds to 0 seconds with the libc 2.12 also.

However, it's impractical to put this all over the code where we do a**b. How would one go about replacing the pow() function in libc? For instance, I would like the assembly line call pow generated by the Fortran compiler to call a custom pow() function we write that does the scaling, calls the libc pow() and then divides by the scaling. How does one create an intermediate layer transparent to the compiler?

Edit

To clarify, we're looking for something like (pseudo-code):

double pow(a,b) {
   a *= 5.0
   tmp = pow_from_libc(a,b)
   return tmp/pow_from_libc(5.0, b)
}

Is it possible to load the pow from libc and rename it in our custom function to avoid the naming conflicts? If the customPow.o file could rename pow from libc, what happens if libc is still needed for other things? Would that cause a naming conflict between pow in customPow.o and pow in libc?

Just write your own pow function, put the .o file in a static library archive libmypow.a somewhere in the linker's library path, and pass -lmypow when linking.

Bit-fields and sequence points

14 votes

For an implementation that packs f0 and f1 into the same byte, is the program below defined?

struct S0 {
       unsigned f0:4;
       signed f1:4;
} l_62;

int main (void) {
       (l_62.f0 = 0) + (l_62.f1 = 0);
       return 0;
}

I am interested in the answer for C99 and for C11 if there is reason to think that it is different there.

In C99, all I found was 6.5.16:4:

[...] If an attempt is made to modify the result of an assignment operator or to access it after the next sequence point, the behavior is undefined.

It is not clear for me what consequences this paragraph has on the program above.

Based on a large number of randomized tests, most compilers appear to generate code where the two assignments do not interfere.

EDIT: the C99 quote above may be the wrong one. I think I meant to quote 6.5:2 instead.

Between the previous and next sequence point an object shall have its stored value modified at most once by the evaluation of an expression. [...]

C11 considers adjacent named bit fields to be part of the same memory location. Such bit fields are not guaranteed to be updated atomically, in other words if one update is not sequenced explicitly before the other the behavior is undefined. 3.14 memory location then also has a detailed explanation of when two fields can be considered being in different memory locations, thus updates to them can be considered independently.

If you would modify your structure

struct S0 {
       unsigned f0:4;
       int :0;
       signed f1:4;
} l_62;

such that there is this bizarre "memory location separator" between the two bit fields, your code would be guaranteed to be fine.

For C99 the case seems to be more complicated, there is not such a detailed concept of memory location. In a recent discussion on the linux kernel mailing list there was a claim that generally for all pairs of bit fields there would be a guarantee of atomicity when updating any of them. The starting point of that discussion was a case where gcc polluted a non-bit field neighboring a bit field in an unexpected way leading to spurious crashes.

Is there a more efficient way of splitting a number into its digits?

14 votes

I have to split a number into its digits in order to display it on an LCD. Right now I use the following method:

pos = 7;

do
{
    LCD_Display(pos, val % 10);
    val /= 10;
    pos--;
} while (pos >= 0 && val);

The problem with this method is that division and modulo operations are extremely slow on an MSP430 microcontroller. Is there any alternative to this method, something that either does not involve division or that reduces the number of operations?

A note: I can't use any library functions, such as itoa. The libraries are big and the functions themselves are rather resource hungry (both in terms of number of cycles, and RAM usage).

You could do subtractions in a loop with predefined base 10 values.

My C is a bit rusty, but something like this:

int num[] = { 10000000,1000000,100000,10000,1000,100,10,1 };

for (pos = 0; pos < 8; pos++) {
  int cnt = 0;
  while (val >= num[pos]) {
    cnt++;
    val -= num[pos];
  }
  LCD_Display(pos, cnt);
}

g++ optimization options affect the value of sin function

12 votes

I have a problem with "sin" function of libc.

#include <cmath>
#include <stdio.h>

int main(int argc, char **argv)
{
    double tt = 6.28318530717958620000; // 2 * M_PI
    double yy = ::sin(tt);

    printf("%.32f\n", yy);

    return 0;
}

When compile the above code using "g++" without any optimization option, it would output "-0.00000000000000024492127076447545". But if with "-O3" option, it would output "-0.00000000000000024492935982947064".

Why doesn't it return "-0.00000000000000024492935982947064" without "-O3"? Thanks in advance.

Because with "-O3" the compiler precomputes sin(2*pi) at compile time, with one algorithm. Without "-O3" this is computed at runtime, with other algorithm.

This may be because compiler itself was built with some math library, which differ from your math library.

Update

The only entity, giving the result "-0.00000000000000024492127076447545" is 32-bit version of libstdc++. 64-bit version of the same library as well as gcc itself produce "-0.00000000000000024492935982947064".

So upgrading to newer version will not help. Also I tried various options, proposed here: neither -ffloat-store, nor -fno-builtin do not make any difference, as well as long double and sinl.

32-bit libstdc++ uses 387 floating point instructions, while gcc apparently uses SSE instructions. Here is the difference. Probably, the only way to make them consistent is to rebuild gcc from sources, directing it to use only 387 instructions internally.

I don't understand this example of fork()

12 votes

I have this example of code, but I don't understand why this code creates 5 processes plus the original. (6 process total)

#include <unistd.h>

int main(void) {
    int i;
    for (i = 0; i < 3; i++) {
        if (fork() && (i == 1)) {
            break;
        }
    }
}

Process graph

fork() splits a process in two, and returns either 0 (if this process is the child), or the PID of the child (if this process is the parent). So, this line:

if (fork() && (i == 1)) break;

Says "if this is the parent process, and this is the second time through the loop, break out of the loop". This means the loop runs like this:

  • i == 0: The first time through the loop, i is 0, we create two processes, both entering the loop at i == 1. Total now two processes

  • i == 1: Both of those processes fork, but two of them do not continue to iterate because of the if (fork() && (i == 1)) break; line (the two that don't continue are both of the parents in the fork calls). Total now four processes, but only two of those are continuing to loop.

  • i == 2: Now, the two that continue the loop both fork, resulting in 6 processes.

  • i == 3: All 6 processes exit the loop (since i < 3 == false , there is no more looping)

GCC return address of calling function in ARM architecture

11 votes

I'm curious why __builtin_return_address() doesn't supports other arguments than 0 in ARM ? It's a problem that somehow you can't deduce calling function address from the stack of ARM ? Or something else ?

Thanks

According to this post <http://codingrelic.geekhold.com/2009/05/pre-mortem-backtracing.html>,

Also on some architectures, including my beloved MIPS, only __builtin_return_address(0) works. MIPS has no frame pointer, making it difficult to walk back up the stack. Frame 0 can use the return address register directly. If ARM also does not have a frame pointer, this would explain the limitation.

See also http://gcc.gnu.org/onlinedocs/gcc/Return-Address.html.

How do I read this complex declaration in C

11 votes

Possible Duplicate:
what's the meaning of this piece of code? void (*signal(int sig, void (*func)(int)))(int);

I have complex declaration which have taken from "signal.h" header file ,below is the declaration

  void (*signal(int sig, void (*func)(int)))(int); 

Now How I parse it as

signal is function taking two arguments ‘sig’ of int type and ‘func’, which is a pointer to a function taking int as an argument and returns void type; it returns a pointer to the function taking int as argument and returning void.

Is it ok or signal is a pointer to function?

Start with the leftmost identifier and work your way out, remembering that [] and () bind before *, so *a[] is an array of pointers, (*a)[] is a pointer to an array, *f() is a function returning a pointer, and (*f)() is a pointer to a function:

       signal                                     -- signal
       signal(                          )         -- is a function
       signal(    sig,                  )         -- with a parameter named sig
       signal(int sig,                  )         --   of type int
       signal(int sig,        func      )         -- and a parameter named func
       signal(int sig,      (*func)     )         --   which is a pointer
       signal(int sig,      (*func)(   ))         --   to a function
       signal(int sig,      (*func)(int))         --     taking an int parameter
       signal(int sig, void (*func)(int))         --     and returning void
      *signal(int sig, void (*func)(int))         -- returning a pointer
     (*signal(int sig, void (*func)(int)))(   )   -- to a function
     (*signal(int sig, void (*func)(int)))(int)   --   taking an int parameter
void (*signal(int sig, void (*func)(int)))(int);  --   and returning void

signal associates a signal handler function func with a signal sig, and returns the pointer to the old signal handler function:

void new_interrupt_handler(int sig)
{
  ... // do something interesting with interrupt signal
}

int main(void)
{
  void (*old_interrupt_handler)(int);
  ...
  /**
   * Set up our new interrupt handler
   */
  old_interrupt_handler = signal(SIGINT, new_interrupt_handler);
  ...
  /**
   * Restore original interrupt handler
   */
  signal(SIGINT, old_interrupt_handler);
  ...
}

How to use external memory on a microcontroller

10 votes

In the past, I've worked a lot with 8 bit AVR's and MSP430's where both the RAM and flash were stored on the chip directly. When you compile and download your program, it sort of "just works" and you don't need to worry about where and how variables are actually stored.

Now I'm starting a project where I'd like to be able to add some external memory to a microcontroller (a TI Stellaris LM3S9D92 if that matters) but I'm not entirely sure how you get your code to use the external RAM. I can see how you configure the external bus pretty much like any other peripheral but what confuses me is how the processor keeps track of when to talk to the external memory and when to talk to the internal one.

From what I can tell, the external RAM is mapped to the same address space as the internal SRAM (internal starts at 0x20000000 and external starts at 0x60000000). Does that mean if I wrote something like this:

int* x= 0x20000000;
int* y= 0x60000000;

Would x and y would point to the first 4 bytes (assuming 32 bit ints) of internal and external RAM respectively? If so, what if I did something like this:

int x[999999999999]; //some super big array that uses all the internal ram
int y[999999999999]; //this would have to be in external ram or it wouldn't fit

I imagine that I'd need to tell something about the boundaries of where each type of memory is or do I have it all wrong and the hardware figures it out on its own? Do linker scripts deal with this? I know they have something to do with memory mapping but I don't know what exactly. After reading about how to set up an ARM cross compiler I get the feeling that something like winavr (avr-gcc) was doing a lot of stuff like this for me behind the scenes so I wouldn't have to deal with it.

Sorry for rambling a bit but I'd really appreciate it if someone could tell me if I'm on the right track with this stuff.

Update

For any future readers I found this after another few hours of googling http://www.bravegnu.org/gnu-eprog/index.html. Combined with answers here it helped me a lot.

Generally that is exactly how it works. You have to properly setup the hardware and/or the hardware may already have things hardcoded at fixed addresses.

You could ask the same question, how does the hardware know that when I write a byte to address 0x21000010 (I just made that up) that that is the uart transmit holding register and that write means I want to send a byte out the uart? The answer because it is hardcoded in the logic that way. Or the logic might have an offset, the uart might be able to move it might be at some other control register contents plus 0x10. change that control register (which itself has some hardcoded address) from 0x21000000, to 0x90000000 and then write to 0x90000010 and another byte goes out the uart.

I would have to look at that particular part, but if it does support external memory, then in theory that is all you have to do know what addresses in the processors address space are mapped to that external memory and reads and writes will cause external memory accesses.

Intel based computers, PC's, tend to like one big flat address space, use the lspci command on your Linux box (if you have one) or some other command if windows or a mac, and you will find that your video card has been given a chunk of address space. If you get through the protection of the cpu/operating system and were to write to an address in that space it will go right out the processor through the pcie controllers and into the video card, either causing havoc or maybe just changing the color of a pixel. You have already dealt with this with your avr and msp430s. Some addresses in the address space are flash, and some are ram, there is some logic outside the cpu core that looks at the cpu cores address bus and makes decisions on where to send that access. So far that flash bank and ram bank and logic are all self contained within the boundaries of the chip, this is not too far of a stretch beyond that the logic responds to an address, and from that creates an external memory cycle, when it is done or the result comes back on a read it completes the internal memory cycle and you go on to the next thing.

Does that make any sense or am I making it worse?

Relearning C: New idioms?

10 votes

I'm relearning C after not having touched it since 2000 or so. I've been working in Ruby since then, and I discovered a whole world of programming idioms I never knew existed.

What important C techniques, books, idioms, etc. have arisen in the past decade, if any? I know about the C99 and C11 standards, but where else should I be looking? Or has C style remained constant even as OOP and FP have become the norm?

C doesn't support at language level nothing more than procedural programming - and that's a precise choice, because it was born to be mostly a "portable assembly" and it's used to work as tightly close to the machine as possible (without resorting to assembly). Most assembly languages do not provide much more, in terms of programming paradigms, than a stack and function call statement (some micros not even that) - and that's what C is modeled upon.

After all, there's a reason why C++ and Objective C were born: C has to keep its design philosophy, and to add more abstract stuff people had to actually fork the language.

That being said, there's nothing stopping you to write e.g. OO code in C - actually, many people do that (I'd say that it's one of the most diffused idioms in C), but you don't have to expect almost any syntax sugar for that: you'll have to use structs for the data, "normal" functions to "emulate" methods, composition for inheritance, pointer tables for polymorphism, and so on. Still, I don't know if this counts as a "last decade" idiom, it is being used since much longer.

Is there any movement towards specifying interaction of C++ exceptions and pthread cancellation?

10 votes

The GNU C library uses DWARF2 unwinding for pthread cancellation these days, so that both C++ exceptions and pthread cancellation cleanup handlers get called through a common call frame unwinding process which invokes destructors for automatic objects as necessary along the way. However, as far as I can tell there is still no standard that specifies the interaction between (POSIX) threads and C++, and presumably an application wishing to be portable should assume that throwing exceptions out of cancellation cleanup contexts is just as undefined as calling longjmp out of them, and that cancelling a thread that has live automatic objects with non-trivial destructors is also undefined behavior.

Is there any standardization process in progress that addresses this interaction, or is it something that can be expected to be undefined well into the future? Does C++11 have any analogous notion to POSIX thread cancellation in its thread support?

As someone who sits on ISO/IEC SC22 which encompasses WG14 (C), WG15 (POSIX) and WG21 (C++), I can tell you that the quick answer is no, C++ exceptions and thread cancellation are not going to see one another any time soon. C11 and C++11 make no mention of thread cancellation, and are highly if not extremely unlikely to recognise it before the next major standards release in about ten years time.

The longer answer comes down to how standards work. Basically ISO can only standardise what everyone can come to agree upon, and people do not agree when it comes to thread cancellation. The whole idea of a thread of execution having to dump state before every cancellable system call goes against the whole ethos of modern software development. It causes immense problems for compiler optimisation because unlike C++ exception throws, a thread cancel is defined to be the same as calling thread_terminate(self) which explicitly precludes doing anything additional (and even cancellation handlers aren't reliably called on many implementations), and I don't think that the thread cancellation supporters would disagree it's a bad solution.

The problem is that the only proper alternative is to reissue the POSIX i/o API with async completion variants. And the problem with that is that different POSIX implementations think of async completion very differently. I mean, we can't even agree on a standard for kernel wait queues, so until that can be achieved an async i/o API is a long way off. I have a proposal to make some movement on kernel wait queues for the next standards TC/TR, but the proposed object is deliberately extremely simplistic.

What we've tried to do in C11/C++11 is for the threading API to always have non-blocking versions - there is only one API in there which can't be done non-blocking which is thread_join() (there is no thread_timedjoin()) and I plan to personally submit an errata on that after I have Austin Working Group approval. In all other cases, one can always construct something which polls which isn't efficient, but is program correct.

In the longer run, personally speaking I see plenty of good reason to add exception handling to C following similar semantics to C++. You wouldn't have object support necessarily (I would actually support adding non-virtual objects to C too personally), but you would have the concept of stack unwound lambda function calls. That would let us formalise hacks like thread cancellation with a properly defined mechanism. It also makes writing fault tolerant C much easier and safer by letting you write the unwind as you write the wind, and lets old C transparently interop with new C.

Regarding throwing exceptions from within exception handling, me personally I think we need to do something better than just always auto invoking terminate(). As unwinding may cause the construction of new objects, or indeed any other source of exception throws, I personally would greatly prefer if every reasonable attempt is made to unwind the whole stack before terminating the process.

So, in short, expect POSIX thread cancellation to continue to be viewed as undefined, and the strong chances are in the long run it'll get deprecated in favour of something better.

BTW, generally POSIX thread cancellation is highly unportable between implementations, so any code which uses POSIX thread cancellation is effectively relying on platform-specific behaviour which is identical to using non-POSIX APIs. If you want your code to be portable, don't use POSIX thread cancellation. Instead use select() or poll() including a magic "please stop thread now" file descriptor. In my own C++ code, I actually have a system API wrapper macro which tests for this magic file descriptor and throws a special C++ exception. This ensures identical behaviour on all platforms, including Windows.

Hope this helps.

Niall

Speed of Python Extensions in C vs. C

9 votes

Python extension modules written in C are faster than the equivalent programs written in pure Python. How do these extension modules compare (speed wise) to programs written in pure C? Are programs written in pure C even faster than the equivalent Python extension module?

How do these extension modules compare (speed wise) to programs written in pure C?

They are slightly slower due to the translation between Python data structures -> C types. Disregarding this translation the actual C code runs at exactly the same speed as a regular C function would.

Are programs written in pure C even faster than the equivalent Python extension module?

C programs (written entirely in C) can be faster than Python programs using the C extension modules. If the C program and the extension module are written with the same level of complexity, coder skill, algorithmic complexity, etc., the C program will win every time. However, if you're not a C guru and you're competing with a highly optimized Python C extension Python could be faster.

Is a Linux executable "compatible" with OS X?

9 votes

If you compile a program in say, C, on a Linux based platform, then port it to use the MacOS libraries, will it work?

Is the core machine-code that comes from a compiler compatible on both Mac and Linux?

The reason I ask this is because both are "UNIX based" so I would think this is true, but I'm not really sure.

Ignore the downvotes and the haters, ghostsoldier23. You're asking a perfectly reasonable question.

However, the answer is no. Linux and Mac OS X binaries are not cross-compatible.

For one thing, Linux executables use a format called ELF.

Mac OS X executables use Mach-O format.

Thus, even if a lot of the libraries ordinarily compile separately on each system, they would not be portable in binary format.

Furthermore, Linux is not actually UNIX-based. It does share a number of common features and tools with UNIX, but a lot of that has to do with computing standards like POSIX.

EDIT:

Finally, to address your point on byte-code: when making a binary, compilers usually generate machine code that is specific to the platform you're developing on. (This isn't always the case, but it usually is.)

How to get instruction information from libopcodes?

7 votes

I am writing a tool which uses libbfd and libopcodes in x86-32 and x86-64 Linux to perform disassembly. The problem is that whilst I am able to get libopcodes to disassemble, I am unable to get any instruction information. For the purposes of demonstration, I have made a minimal example which reproduces my issue. The program should disassemble itself from entry point to the first RET/RETQ.

The code is a bit hacked up with globals and error checking has been omitted for brevity, etc. but should illustrate the issue clearly.

#include <bfd.h>
#include <dis-asm.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <string.h>
#include <ctype.h>
#include <limits.h>
#include <libiberty.h>

/*
 * Holds state for BFD and libopcodes.
 */
bfd *        abfd  = NULL;
disassemble_info dinfo = {0};

/*
 * Temporary hack to signal when disassembling should stop.
 */
static bool stop_disassembling = FALSE;

/*
 * Gets path to currently running executable.
 */
bool get_target_path(char * target_path, size_t size)
{
    char *   path;
    ssize_t len;

    pid_t pid = getpid();
    sprintf(target_path, "/proc/%d/exe", (int)pid );

    path = strdup(target_path);
    len  = readlink(path, target_path, size);

    target_path[len] = '\0';
    free(path);
    return TRUE;
}

/*
 * libopcodes appends spaces on the end of some instructions so for
 * comparisons, we want to strip those first.
 */
void strip_tail(char * str, unsigned int size)
{
    int i;
    for(i = 0; i < size; i++) {
        if(!isgraph(str[i])) {
            str[i] = '\0';
            break;
        }
    }
}

/*
 * Checks whether the current instruction will cause the control flow to not
 * proceed to the linearly subsequent instruction (e.g. ret, jmp, etc.)
 */
bool breaks_control_flow(char * str)
{
    if(abfd->arch_info->bits_per_address == 64) {
        if(strcmp(str, "retq") == 0) {
            return TRUE;
        }
    } else {
        if(strcmp(str, "ret") == 0) {
            return TRUE;
        }
    }

    return FALSE;
}

/*
 * Used as a callback for libopcodes so we can do something useful with the
 * disassembly. Currently this just outputs to stdout.
 */
int custom_fprintf(void * stream, const char * format, ...)
{
    /* silly amount */
    char    str[128] = {0};
    int rv;
    va_list args;

    va_start(args, format);
    rv = vsnprintf(str, ARRAY_SIZE(str) - 1, format, args);
    va_end(args);

    puts(str);
    strip_tail(str, ARRAY_SIZE(str));

    if(breaks_control_flow(str)) {
        puts("Stopped disassembly");
        stop_disassembling = TRUE;
    }

    if(dinfo.insn_info_valid) {
        switch(dinfo.insn_type) {
            case dis_noninsn:
                printf("not an instruction\n");
                break;
            case dis_nonbranch:
                printf("not a branch\n");
                break;
            case dis_branch:
                printf("is a branch\n");
                break;
            case dis_condbranch:
                printf("is a conditional branch\n");
                break;
            case dis_jsr:
                printf("jump to subroutine\n");
                break;
            case dis_condjsr:
                printf("conditional jump to subroutine\n");
                break;
            case dis_dref:
                printf("data reference in instruction\n");
                break;
            case dis_dref2:
                printf("two data references in instruction\n");
                break;
            default:
                printf("not enumerated\n");
                break;
        }
    } else {
        printf("insn_info not valid\n");
    }

    return rv;
}

/*
 * Initialises libopcodes disassembler and returns an instance of it.
 */
disassembler_ftype init_disasm(bfd * abfd, disassemble_info * dinfo)
{
    /* Override the stream the disassembler outputs to */
    init_disassemble_info(dinfo, NULL, custom_fprintf);
    dinfo->flavour = bfd_get_flavour(abfd);
    dinfo->arch    = bfd_get_arch(abfd);
    dinfo->mach    = bfd_get_mach(abfd);
    dinfo->endian  = abfd->xvec->byteorder;
    disassemble_init_for_target(dinfo);

    return disassembler(abfd);
}

/*
 * Method of locating section from VMA taken from opdis.
 */
typedef struct {
    bfd_vma    vma;
    asection * sec;
} BFD_VMA_SECTION;

/*
 * Loads section and fills in dinfo accordingly. Since this function allocates
 * memory in dinfo->buffer, callers need to call free once they are finished.
 */
bool load_section(bfd * abfd, disassemble_info * dinfo, asection * s)
{
    int     size = bfd_section_size(s->owner, s);
    unsigned char * buf  = xmalloc(size);

    if(!bfd_get_section_contents(s->owner, s, buf, 0, size)) {
        free(buf);
        return FALSE;
    }

    dinfo->section       = s;
    dinfo->buffer        = buf;
    dinfo->buffer_length = size;
    dinfo->buffer_vma    = bfd_section_vma(s->owner, s);

    printf("Allocated %d bytes for %s section\n: 0x%lX", size, s->name,
            dinfo->buffer_vma);
    return TRUE;
}

/*
 * Used to locate section for a vma.
 */
void vma_in_section(bfd * abfd, asection * s, void * data)
{
    BFD_VMA_SECTION * req = data;

    if(req && req->vma >= s->vma &&
    req->vma < (s->vma + bfd_section_size(abfd, s)) ) {
        req->sec = s;
    }
}

/*
 * Locate and load section containing vma.
 */
bool load_section_for_vma(bfd * abfd, disassemble_info * dinfo,
        bfd_vma vma)
{
    BFD_VMA_SECTION req = {vma, NULL};
    bfd_map_over_sections(abfd, vma_in_section, &req);

    if(!req.sec) {
        return FALSE;
    } else {
        return load_section(abfd, dinfo, req.sec);
    }
}

/*
 * Start disassembling from entry point.
 */
bool disassemble_entry(bfd * abfd, disassemble_info * dinfo,
        disassembler_ftype disassembler)
{
    bfd_vma    vma = bfd_get_start_address(abfd);

    /* First locate and load the section containing the vma */
    if(load_section_for_vma(abfd, dinfo, vma)) {
        int size;

        /* Keep disassembling until signalled otherwise or error */
        while(true) {
            dinfo->insn_info_valid = 0;
            size = disassembler(vma, dinfo);
            printf("Disassembled %d bytes at 0x%lX\n", size, vma);

            if(size == 0 || size == -1 || stop_disassembling) {
                break;
            }

            vma += size;
        }

        free(dinfo->buffer);
        return TRUE;
    }

    return FALSE;
}

int main(void)
{
    char  target_path[PATH_MAX] = {0};

    /* Get path for the running instance of this program */
    get_target_path(target_path, ARRAY_SIZE(target_path));

    abfd = bfd_openr(target_path, NULL);

    if(abfd != NULL && bfd_check_format(abfd, bfd_object)) {
        disassembler_ftype disassembler = init_disasm(abfd, &dinfo);

        disassemble_entry(abfd, &dinfo, disassembler);

        bfd_close(abfd);
    }

    return EXIT_SUCCESS;
}

This source can be built with the following makefile. To perform a successful link, the binutils-dev package needs to be installed on the local machine:

all:
    gcc -Wall disasm.c -o disasm -lbfd -lopcodes

clean:
    rm -f disasm

When run, the output is this:

Allocated 2216 bytes for .text section
: 0x400BF0xor    
insn_info not valid
%ebp
insn_info not valid
,
insn_info not valid
%ebp
insn_info not valid
Disassembled 2 bytes at 0x400BF0
mov    
insn_info not valid
%rdx
insn_info not valid
,
insn_info not valid
%r9
insn_info not valid
Disassembled 3 bytes at 0x400BF2
pop    
insn_info not valid
%rsi
insn_info not valid
Disassembled 1 bytes at 0x400BF5
mov    
insn_info not valid
%rsp
insn_info not valid
,
insn_info not valid
%rdx
insn_info not valid
Disassembled 3 bytes at 0x400BF6
and    
insn_info not valid
$0xfffffffffffffff0
insn_info not valid
,
insn_info not valid
%rsp
insn_info not valid
Disassembled 4 bytes at 0x400BF9
push   
insn_info not valid
%rax
insn_info not valid
Disassembled 1 bytes at 0x400BFD
push   
insn_info not valid
%rsp
insn_info not valid
Disassembled 1 bytes at 0x400BFE
mov    
insn_info not valid
$0x401450
insn_info not valid
,
insn_info not valid
%r8
insn_info not valid
Disassembled 7 bytes at 0x400BFF
mov    
insn_info not valid
$0x4013c0
insn_info not valid
,
insn_info not valid
%rcx
insn_info not valid
Disassembled 7 bytes at 0x400C06
mov    
insn_info not valid
$0x4012ce
insn_info not valid
,
insn_info not valid
%rdi
insn_info not valid
Disassembled 7 bytes at 0x400C0D
callq  
insn_info not valid
0x0000000000400ad8
insn_info not valid
Disassembled 5 bytes at 0x400C14
hlt    
insn_info not valid
Disassembled 1 bytes at 0x400C19
nop
insn_info not valid
Disassembled 1 bytes at 0x400C1A
nop
insn_info not valid
Disassembled 1 bytes at 0x400C1B
sub    
insn_info not valid
$0x8
insn_info not valid
,
insn_info not valid
%rsp
insn_info not valid
Disassembled 4 bytes at 0x400C1C
mov    
insn_info not valid
0x2013b9(%rip)
insn_info not valid
,
insn_info not valid
%rax
insn_info not valid
        # 
insn_info not valid
0x0000000000601fe0
insn_info not valid
Disassembled 7 bytes at 0x400C20
test   
insn_info not valid
%rax
insn_info not valid
,
insn_info not valid
%rax
insn_info not valid
Disassembled 3 bytes at 0x400C27
je     
insn_info not valid
0x0000000000400c2e
insn_info not valid
Disassembled 2 bytes at 0x400C2A
callq  
insn_info not valid
*%rax
insn_info not valid
Disassembled 2 bytes at 0x400C2C
add    
insn_info not valid
$0x8
insn_info not valid
,
insn_info not valid
%rsp
insn_info not valid
Disassembled 4 bytes at 0x400C2E
retq   
Stopped disassembly
insn_info not valid
Disassembled 1 bytes at 0x400C32

What I am expecting is to be able to read instruction information for each instruction through the dinfo->insn_type, target, etc. The behaviour is exhibited on both x86-32 and x86-64. If I can at least get confirmation that this is unimplemented on these two architectures then I can go about filling in this information myself.

Unfortunately, as of binutils libopcodes 2.22, insn_type is not filled in on either i386 or x86_64. The only widespread supported architectures are MIPS, Sparc, and the Cell’s SPU. This is still true as of current CVS HEAD.

It's hard to prove that something does not exist, but for instance, in the Sparc disassembler source you can see several occurrences of insn_type being set, for instance info->insn_type = dis_branch, whereas in the i386 disassembler source there are no occurrences of insn_type nor any of the values it would be expected to have (dis_branch, dis_nonbranch etc.).

Checking for all the libopcodes files that support insn_type you get:

  • opcodes/mips-dis.c
  • opcodes/spu-dis.c
  • opcodes/microblaze-dis.c
  • opcodes/cris-dis.c
  • opcodes/sparc-dis.c
  • opcodes/mmix-dis.c

Distinction between processes and threads in Linux

6 votes

After reading up on this answer and "Linux Kernel Development" by Robert Love and, subsequently, on the clone() system call, I discovered that processes and threads in Linux are (almost) indistinguishable to the kernel. There are a few tweaks between them (discussed as being "more sharing" or "less sharing" in the quoted SO question), but I do still have some questions yet to be answered.

I recently worked on a program involving a couple of POSIX threads and decided to experiment on this premise. On a process that creates two threads, all threads of course get a unique value returned by pthread_self(), however, not by getpid().

A sample program I created follows:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <unistd.h>
#include <pthread.h>

void* threadMethod(void* arg)
{
    int intArg = (int) *((int*) arg);

    int32_t pid = getpid();
    uint64_t pti = pthread_self();

    printf("[Thread %d] getpid() = %d\n", intArg, pid);
    printf("[Thread %d] pthread_self() = %lu\n", intArg, pti);
}

int main()
{
    pthread_t threads[2];

    int thread1 = 1;

    if ((pthread_create(&threads[0], NULL, threadMethod, (void*) &thread1))
         != 0)
    {
        fprintf(stderr, "pthread_create: error\n");
        exit(EXIT_FAILURE);
    }

    int thread2 = 2;

    if ((pthread_create(&threads[1], NULL, threadMethod, (void*) &thread2))
         != 0)
    {
        fprintf(stderr, "pthread_create: error\n");
        exit(EXIT_FAILURE);
    }

    int32_t pid = getpid();
    uint64_t pti = pthread_self();

    printf("[Process] getpid() = %d\n", pid);
    printf("[Process] pthread_self() = %lu\n", pti);

    if ((pthread_join(threads[0], NULL)) != 0)
    {
        fprintf(stderr, "Could not join thread 1\n");
        exit(EXIT_FAILURE);
    }

    if ((pthread_join(threads[1], NULL)) != 0)
    {
        fprintf(stderr, "Could not join thread 2\n");
        exit(EXIT_FAILURE);
    }

    return 0;
}

(This was compiled on 64-bit Fedora; due to the 64-bit types used for pthread_t sourced from <bits/pthreadtypes.h>, the code will require minor changes to compile on 32-bit editions.)

The output I get is as follows:

[bean@fedora ~]$ ./thread_test 
[Process] getpid() = 28549
[Process] pthread_self() = 140050170017568
[Thread 2] getpid() = 28549
[Thread 2] pthread_self() = 140050161620736
[Thread 1] getpid() = 28549
[Thread 1] pthread_self() = 140050170013440
[bean@fedora ~]$ 

By using scheduler locking in gdb, I can keep the program and its threads alive so I can capture what top says, which, just showing processes, is:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
28602 bean      20   0 15272 1112  820 R  0.4  0.0   0:00.63 top
 2036 bean      20   0  108m 1868 1412 S  0.0  0.0   0:00.11 bash
28547 bean      20   0  231m  16m 7676 S  0.0  0.4   0:01.56 gdb
28549 bean      20   0 22688  340  248 t  0.0  0.0   0:00.26 thread_test
28561 bean      20   0  107m 1712 1356 S  0.0  0.0   0:00.07 bash

And when showing threads, says:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
28617 bean      20   0 15272 1116  820 R 47.2  0.0   0:00.08 top
 2036 bean      20   0  108m 1868 1412 S  0.0  0.0   0:00.11 bash
28547 bean      20   0  231m  16m 7676 S  0.0  0.4   0:01.56 gdb
28549 bean      20   0 22688  340  248 t  0.0  0.0   0:00.26 thread_test
28552 bean      20   0 22688  340  248 t  0.0  0.0   0:00.00 thread_test
28553 bean      20   0 22688  340  248 t  0.0  0.0   0:00.00 thread_test
28561 bean      20   0  107m 1860 1432 S  0.0  0.0   0:00.08 bash

It seems to be quite clear that programs, or perhaps the kernel, have a distinct way of defining threads in contrast to processes. Each thread has its own PID according to top - why?

These confusions all stem from the fact that the kernel developers originally held an irrational and wrong view that threads could be implemented almost entirely in userspace using kernel processes as the primitive, as long as the kernel offered a way to make them share memory and file descriptors. This lead to the notoriously bad LinuxThreads implementation of POSIX threads, which was rather a misnomer because it did not give anything remotely resembling POSIX thread semantics. Eventually LinuxThreads was replaced (by NPTL), but a lot of the confusing terminology and misunderstandings persist.

The first and most important thing to realize is that "PID" means different things in kernel space and user space. What the kernel calls PIDs are actually kernel-level thread ids (often called TIDs), not to be confused with pthread_t which is a separate identifier. Each thread on the system, whether in the same process or a different one, has a unique TID (or "PID" in the kernel's terminology).

What's considered a PID in the POSIX sense of "process", on the other hand, is called a "thread group ID" or "TGID" in the kernel. Each process consists of one or more threads (kernel processes) each with their own TID (kernel PID), but all sharing the same TGID, which is equal to the TID (kernel PID) of the initial thread in which main runs.

When top shows you threads, it's showing TIDs (kernel PIDs), not PIDs (kernel TGIDs), and this is why each thread has a separate one.

With the advent of NPTL, most system calls that take a PID argument or act on the calling process were changed to treat the PID as a TGID and act on the whole "thread group" (POSIX process).