Best linux questions in June 2011

Does epoll(), do its job in O(1)?

21 votes

Wikipedia says

unlike the older system calls, which operate at O(n), epoll operates in O(1) [2]).

http://en.wikipedia.org/wiki/Epoll

However, the source code at fs/eventpoll.c on Linux-2.6.38, seems it is implemented with an RB tree for searching, which has O(logN)

/*
 * Search the file inside the eventpoll tree. The RB tree operations
 * are protected by the "mtx" mutex, and ep_find() must be called with
 * "mtx" held.
 */
static struct epitem *ep_find(struct eventpoll *ep, struct file *file, int fd)
{

In fact, I couldn't see any man page saying the complexity of epoll() is O(1). Why is it known as O(1)?

This makes sens once you look for ep_find. I only spent a few minutes with it and I see ep_find is only called by epoll_ctl.

So indeed, when you add the descriptors (EPOLL_CTL_ADD) that costly operation is performed. BUT when doing the real work (epoll_wait) it isn't. You only add the descriptors in the beginning.

In conclusion, it's not enough to ask the complexity of epoll, since there is no epoll system call. You want the individual complexities of epoll_ctl, epoll_wait etc.

Other stuff

There are other reasons to avoid select and use epoll. When using select, you don't know how many descriptors need attention. So you must keep track of the biggest and loop to it.

rc = select(...);
/* check rc */
for (s = 0; s <= maxfd; s++) {
    if (FD_ISSET(s)) {
        /* ... */
    }
}

Now with epoll it's a lot cleaner:

nfds = epoll_wait(epollfd, events, MAX_EVENTS, -1);
/* check nfds */
for (n = 0; n < nfds; ++n) {
    /* events[n].data.fd needs attention */
}

Recursion without recursive call?

14 votes

Found this on /prog/. I actually GDB'd it, and yes, it was truly a recursion. But how did it happen?

// This works on 32-bit x86 Linux with gcc as long as you don't enable optimization.

#include <stdio.h>
#include <stdlib.h>

static void factorial(int in, int *out)
{
  *(&in-1)-=5-5*(1/in);
  *out*=in--;
}

int main(int argc, char **argv) 
{
  int result=1;
  int number=0;

  if (argc!=2) 
    exit(1);

  number=atoi(argv[1]);
  if (number<1)
    exit(2);

  factorial(number, &result);
  printf("%d! = %d\n", number, result);
  return 0;
}


$ ./factorial 3
3! = 6

$ ./factorial 5
5! = 120

Sweet. ;)

This is extremely non-portable code that works only on x86. What it's doing is changing the return address on the stack so that if in>1, the function returns not to the instruction following the call instruction, but to the call instruction itself. A call instruction on x86 is five bytes (one opcode plus the 4-byte address of the call destination), so five needs to be subtracted from the return address.

This

*(&in-1)-=5-5*(1/in);

is just an obfuscated way of saying

if(in>1)
    *(&in-1)-=5;

And &in-1 is where the return address lives on the stack.

Who uses POSIX realtime signals and why?

11 votes

I am not being flip I really don't get it. I just read a whole bunch of material on them and I can't figure out the use case. I am not talking talking so much about the API for which the advantages over things like signal() are clear enough. Rather it seems RT signals are meant to be user space generated but to what end? The only use seems to be a primitive IPC but everything points to them being a lousy form of IPC (e.g. awkward, limited information, not particularly efficient, etc).

So where and how are they used?

First of all, note that Ben's answer is correct. As far as I can tell, the whole purpose of realtime signals in POSIX is as a realtime delivery mechanism for AIO, message queue notifications, timer expirations, and application-defined signals (both internal and inter-process).

With that said, signals in general are a really bad way to do things:

  • Signal handlers are asynchronous, and unless you ensure they do not interrupt an async-signal-unsafe function, they can only use async-signal-safe functions, which really cripples what they can do.
  • Signal handlers are global state. A library cannot use signals without a contract with the calling program regarding which signals it's allowed to use, whether it's allowed to make them syscall-interrupting, etc. And in general, global state is just A Bad Thing.
  • If you use sigwait (or Linux signalfd extension) rather than signal handlers to process signals, they're no better than other IPC/notification mechanisms, and still potentially worse.

Asynchronous IO is much better accomplished by ignoring the ill-designed POSIX AIO API and just creating a thread to perform normal blocking IO and call pthread_cond_signal or sem_post when the operation finishes. Or, if you can afford a little bit of performance cost, you can even forward the just-read data back to yourself over a pipe or socketpair, and have the main thread process asynchronously-read regular files with select or poll just like you would sockets/pipes/ttys.

Which C/C++ Compilers do enterprises use on Linux?

9 votes

I have been using the GCC Compiler for months, which is great, and works very well. But I wonder which C++ Compiler do big/medium enterprises use for high optimizations/performance in Linux (x86, PowerPC...).

It may seem a very stupid question, but I havent found the answer anywhere.

As far as I know, the best PowerPC Compiler is the XL, but x86 I dont know anything.

EDIT: Thanks a lot for all the answers. They were all very helpful. You have convinced me to use GCC ;) Regards!

In the places I have worked we have always used gcc, even for embedded applications/software.

Performance improvements are much more likely to come from your code rather than your compiler choice anyway!

struct member alignment - is it possible to assume no padding

9 votes

Imagine a struct made up of 32-bit, 16-bit, and 8-bit member values. Where the ordering of member values is such that each member is on it's natural boundary.

struct Foo
{
    uint32_t a;
    uint16_t b;
    uint8_t c;
    uint8_t d;
    uint32_t e;
};

Member alignment and padding rules are documented for Visual C++. sizeof(Foo) on VC++ the above struct is predictably "12".

Now, I'm pretty sure the rule is that no assumption should be made about padding and alignment, but in practice, do other compilers on other operating systems make similar guarantees?

If not, is there an equivalent of "#pragma pack(1)" on GCC?

On systems that actually offer those types, it is highly likely to work. On, say, a 36-bit system those types would not be available in the first place.

GCC provides an attribute
 

__attribute__ ((packed))

With similar effect.

Time terminology

9 votes

I'm a bit lost with time terminology.

I understand what epoch is and the difference between GMT and UTC.

However I'm a bit confused about the following notations:

  • calendar time
  • local time
  • wall time

How are these related to timezones and daylight savings?

P.S. Good link (thanks to @ZincX)

  • Calendar time is the time that has elapsed since the epoch (t - E).
  • Local time is calendar time corrected for the timezone and DST.
  • Wall time: I assume you mean wallclock time. This it the time elapsed since a process or job has started running.

RTFM time(7), localtime(3), time(1).

Why is a pthread mutex considered "slower" than a futex?

8 votes

Why are POSIX mutexes considered heavier or slower than futexes? Where is the overhead coming from in the pthread mutex type? I've heard that pthread mutexes are based on futexes, and when uncontested, do not make any calls into the kernel. It seems then that a pthread mutex is merely a "wrapper" around a futex.

Is the overhead simply in the function-wrapper call and the need for the mutex function to "setup" the futex (i.e., basically the setup of the stack for the pthread mutex function call)? Or are there some extra memory barrier steps taking place with the pthread mutex?

Because they stay in userspace as much as possible, which means they require fewer system calls, which is inherently faster because the context switch between user and kernel mode is expensive.

I assume you're talking about kernel threads when you talk about POSIX threads. It's entirely possible to have an entirely userspace implementation of POSIX threads which require no system calls but have other issues of their own.

My understanding is that a futex is halfway between a kernel POSIX thread and a userspace POSIX thread.

Is there a way to flag the use of non-reentrant C library calls?

8 votes

I'm working on a project that's heavily multi-threaded, and was wondering if there's a way to have the compiler flag the use of non-reentrant calls to the C library (e.g. strtok intsead of strtok_r)? If not, is there a list of calls that are non-reentrant so I can grep through my code base periodically?

A related question is if there's a way to flag 3d party library use of non-reentrant calls.

I'm assuming reentrancy implies thread-safety, but not necessarily the other way around. Is there a good reason to use non-reentrant calls in a threaded project?

For source, you could possibly insist that every source file contains the line:

#include <beware.h>

after the C headers, and then the beware.h header file contains:

#define strtok   unsafe_function_call_detected_strtok
#define getenv   unsafe_function_call_detected_getenv

or some other suitable set of names that are unlikely to be real functions. That will result in compilation and/or linker errors.

For libraries, it's a bit more difficult. You can look into using nm to extract all the unresolved names in each object file and ensure none of the unsafe ones are called.

This wouldn't be the compiler doing it but it would be easy enough to incorporate into the build scripts. See the following transcript:

$ cat qq.c
    #include <stdio.h>

    int main (int argc, char *argv[]) {
        printf ("Hello, world.\n");
        return 0;
    }

$ gcc -c -o qq.o qq.c

$ nm qq.o
00000000 b .bss
00000000 d .data
00000000 r .rdata
00000000 t .text
         U ___main
00000000 T _main
         U _puts

You can see the unresolved symbols in that output with a U marker (and gcc has very sneakily decided to use puts instead of printf since I gave it a constant string with no formatting commands).

Why won't LD_PRELOAD work with Python?

7 votes

Using function interposition for open() with Python doesn't seem to work after the first few calls. I suspect Python is doing some kind of initialization, or something is temporarily bypassing my function.

Here the open call is clearly hooked:

$ cat a
hi
$ LD_PRELOAD=./libinterpose_python.so cat a
sandbox_init()
open()
hi

Here it happens once during Python initialization:

$ LD_PRELOAD=./libinterpose_python.so python
sandbox_init()
Python 2.7.2 (default, Jun 12 2011, 20:20:34) 
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
open()
>>> 
sandbox_fini()

Here it doesn't happen at all, and there's no error to indicate the file handle had write privileges removed:

$ LD_PRELOAD=./libinterpose_python.so python3 -c 'b = open("a", "w"); b.write("hi\n"); b.flush()'
sandbox_init()
sandbox_fini()

The code is here. Build with make -f Makefile.interpose_python.

Solution

It turns out there is an open64() function:

$ objdump -T /lib32/libc.so.6  | grep '\bopen'
00064f10 g    DF .text  000000fc  GLIBC_2.4   open_wmemstream
000cc010 g    DF .text  0000007b  GLIBC_2.0   openlog
000bf6d0  w   DF .text  000000b6  GLIBC_2.1   open64
00094460  w   DF .text  00000055  GLIBC_2.0   opendir
0005f9b0 g    DF .text  000000d9  GLIBC_2.0   open_memstream
000bf650  w   DF .text  0000007a  GLIBC_2.0   open
000bf980  w   DF .text  00000081  GLIBC_2.4   openat
000bfb90  w   DF .text  00000081  GLIBC_2.4   openat64

The open64() function is a part of the large file extensions, and is equivalent to calling open() with the O_LARGEFILE flag.

Running the example code with the open64 section uncommented gives:

$ LD_PRELOAD=./libinterpose_python.so python3 -c 'b = open("a", "w"); b.write("hi\n"); b.flush()'
sandbox_init()
open64()
open64()
open64()
Traceback (most recent call last):
  File "<string>", line 1, in <module>
open64()
open64()
open64()
open64()
open64()
open64()
open64()
IOError: [Errno 9] Bad file descriptor
sandbox_fini()

Which clearly shows all of Python's open calls, and several propagated errors due to the write flag being stripped from the calls.

There are open() and open64() functions, you might need to redefine both.

C source in .W file

7 votes

I found a C source in these files with .w extensions. It seems like a mix of TeX code and C Programming Language. This is an example of these sources.

How can I compile?

PS: Excuse me for the silly question but I didn't found any documentation

Use Knuth's CWEB, a literate-programming tool. You can download it from here.

How can I programmatically get the amount of memory currently available from C/C++ code?

6 votes

In my middleware software layer I am receiving a lot of crashes due with the message,

page allocation failure. order:10, mode:0xd1

As I understand the crash can occur due to number of reasons, running out of dynamic memory for further allocations or Memory fragmentation.

The backtrace which follows the message is not relevant beause it comes through a third party driver module, Most likely the problem is with that driver but Unfortunately, I can neither get any debug information from that nor the source code. I would like to profile my source code using some programatic function calls so as to rule out the possibility of the two scenarios I mentioned above.

I cannot use valgrind because ARM is not yet fully supported by Valgrind.

[Update:] Adding the stacktrace, as @caf's answer suggests there can be some valuable information in it.

Application: page allocation failure. order:10, mode:0xd1
Backtrace:                                                  
[<c00297d0>] (dump_backtrace+0x0/0x114) from [<c02812b8>] (dump_stack+0x18/0x1c)
 r7:0000000a r6:000000d1 r5:00000000 r4:00000000                                
[<c02812a0>] (dump_stack+0x0/0x1c) from [<c00716e4>] (__alloc_pages_nodemask+0x49c/0x4fc)
[<c0071248>] (__alloc_pages_nodemask+0x0/0x4fc) from [<c007175c>] (__get_free_pages+0x18/0x44)
[<c0071744>] (__get_free_pages+0x0/0x44) from [<bf021790>] (tsif_request_rx_buffer+0x74/0xf4 [tsif_data])
[<bf02171c>] (tsif_request_rx_buffer+0x0/0xf4 [tsif_data]) from [<bf021bd8>] (tsif_data_ioctl+0x17c/0x9d4 [tsif_data])
 r7:be9e8604 r6:c0045319 r5:c3793400 r4:00000007
[<bf021a5c>] (tsif_data_ioctl+0x0/0x9d4 [tsif_data]) from [<c00a1c30>] (vfs_ioctl+0x78/0x94)
[<c00a1bb8>] (vfs_ioctl+0x0/0x94) from [<c00a22e0>] (do_vfs_ioctl+0x594/0x5f0)
 r7:c2067e80 r6:00000021 r5:c2067e80 r4:00000021
[<c00a1d4c>] (do_vfs_ioctl+0x0/0x5f0) from [<c00a237c>] (sys_ioctl+0x40/0x64)
[<c00a233c>] (sys_ioctl+0x0/0x64) from [<c0025ec0>] (ret_fast_syscall+0x0/0x28)
 r7:00000036 r6:00144220 r5:00139030 r4:008a47cc
Mem-info:
DMA per-cpu:
CPU    0: hi:   18, btch:   3 usd:   0
active_anon:4120 inactive_anon:134 isolated_anon:0
 active_file:79 inactive_file:3729 isolated_file:0
 unevictable:0 dirty:0 writeback:0 unstable:0 buffer:0
 free:4137 slab_reclaimable:198 slab_unreclaimable:894
 mapped:1707 shmem:64 pagetables:75 bounce:0
DMA free:16548kB min:1104kB low:1380kB high:1656kB active_anon:16480kB inactive_anon:536kB active_file:316kB inactive_file:14916kB unevictable:0kB isolated(o
lowmem_reserve[]: 0 0 0
DMA: 215*4kB 131*8kB 73*16kB 49*32kB 30*64kB 8*128kB 3*256kB 2*512kB 3*1024kB 2*2048kB 0*4096kB 0*8192kB 0*16384kB = 16548kB
3872 total pagecache pages
19200 pages of RAM
4209 free pages
4738 reserved pages
960 slab pages
1751 pages shared
0 pages swap cached

So the question is How can I programmatically get the amount of memory currently available from C/C++ code, the platform is Linux.

The message you have shown indicates a failure to allocate memory for a kernel allocation, not a userspace allocation. It is a request for a 4MB (this is what order = 10 means) block of contiguous physical memory. This is a very large kmalloc() request, and it is not surprising that it fails (likely due to memory fragmentation rather than free memory).

You can find out how much free memory is available in /proc/meminfo, however a more detailed summary is available from the kernel log immediately after the backtrace - starting with the line "Mem-info".


Your backtrace shows that there is 16548kB of memory available, but the largest block is 2048kB (order = 9). So memory fragmentation is indeed your problem.

From reading the source to the tsif driver you appear to be using, it seems that the driver requests a kernel allocation with a size entirely controlled by userspace, invoked by the TSIF_REQ_RX_BUF ioctl() (this is a really bad design, especially given that it doesn't even try to report failures to userspace!). My suggestion is to reduce the size of the buffer you are requesting with this ioctl.

Is there an equivalent to Python's expandvars in C?

6 votes

I have a string stored in a file that is read into a string. I wish to replace variables defined in *nix shell format with the corresponding environment values.

For example, an environment variable of $DEPLOY=/home/user will turn "deploypath=$DEPLOY/dir1" into "deploypath=/home/user/dir1"

Is there a simple library to do this?

i.e.

#include "supersimplelib.h"
char *newstr = expandvars(oldstr);

(or similar)

I understand I could use a regular expression lib and then call getenv() but I was wondering if there was another simpler way?

It will only be compiled under Linux.

wordexp appears to do what you need. Here's a modified version of an example program from this manpage (which also gives a lot of excellent detail on wordexp).

#include <stdio.h>
#include <wordexp.h>
int main(int argc, char **argv) {
        wordexp_t p;
        char **w;
        int i;
        wordexp("This is my path: $PATH", &p, 0);
        w = p.we_wordv;
        for (i=0; i<p.we_wordc; i++)
                printf("%s ", w[i]);
        printf("\n");
        wordfree(&p);
        return 0;
}

This produces the following output on my machine (Ubuntu Linux 10.04, used gcc to compile).

$ ./a.out 
This is my path: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games 

I found the manpage above to be most useful, but there's also more information from the GNU C Library Reference Manual.