Description
When allocating and deallocating randomly sized memory chunks with 4 or more threads using openmp's parallel for construct, the program seems to start leaking considerable amounts of memory in the second half of the test-program's runtime. Thus it increases its consumed memory from 1050 MB to 1500 MB or more without actually making use of the extra memory.
As valgrind shows no issues, I must assume that what appears to be a memory leak actually is an emphasized effect of memory fragmentation.
Interestingly, the effect does not show yet if 2 threads make 10000 allocations each, but it shows strongly if 4 threads make 5000 allocations each. Also, if the maximum size of allocated chunks is reduced to 256kb (from 1mb), the effect gets weaker.
Can heavy concurrency emphasize fragmentation that much ? Or is this more likely to be a bug in the heap ?
Test Program Description
The demo program is build to obtain a total of 256 MB of randomly sized memory chunks from the heap, doing 5000 allocations. If the memory limit is hit, the chunks allocated first will be deallocated until the memory consumption falls below the limit. Once 5000 allocations where performed, all memory is released and the loop ends. All this work is done for each thread generated by openmp.
This memory allocation scheme allows us to expect a memory consumption of ~260 MB per thread (including some bookkeeping data).
Demo Program
As this is really something you might want to test, you can download the sample program with a simple makefile from dropbox.
When running the program as is, you should have at least 1400 MB of RAM available. Feel free to adjust the constants in the code to suit your needs.
For completeness, the actual code follows:
#include <stdlib.h>
#include <stdio.h>
#include <iostream>
#include <vector>
#include <deque>
#include <omp.h>
#include <math.h>
typedef unsigned long long uint64_t;
void runParallelAllocTest()
{
// constants
const int NUM_ALLOCATIONS = 5000; // alloc's per thread
const int NUM_THREADS = 4; // how many threads?
const int NUM_ITERS = NUM_THREADS;// how many overall repetions
const bool USE_NEW = true; // use new or malloc? , seems to make no difference (as it should)
const bool DEBUG_ALLOCS = false; // debug output
// pre store allocation sizes
const int NUM_PRE_ALLOCS = 20000;
const uint64_t MEM_LIMIT = (1024 * 1024) * 256; // x MB per process
const size_t MAX_CHUNK_SIZE = 1024 * 1024 * 1;
srand(1);
std::vector<size_t> allocations;
allocations.resize(NUM_PRE_ALLOCS);
for (int i = 0; i < NUM_PRE_ALLOCS; i++) {
allocations[i] = rand() % MAX_CHUNK_SIZE; // use up to x MB chunks
}
#pragma omp parallel num_threads(NUM_THREADS)
#pragma omp for
for (int i = 0; i < NUM_ITERS; ++i) {
uint64_t long totalAllocBytes = 0;
uint64_t currAllocBytes = 0;
std::deque< std::pair<char*, uint64_t> > pointers;
const int myId = omp_get_thread_num();
for (int j = 0; j < NUM_ALLOCATIONS; ++j) {
// new allocation
const size_t allocSize = allocations[(myId * 100 + j) % NUM_PRE_ALLOCS ];
char* pnt = NULL;
if (USE_NEW) {
pnt = new char[allocSize];
} else {
pnt = (char*) malloc(allocSize);
}
pointers.push_back(std::make_pair(pnt, allocSize));
totalAllocBytes += allocSize;
currAllocBytes += allocSize;
// fill with values to add "delay"
for (int fill = 0; fill < (int) allocSize; ++fill) {
pnt[fill] = (char)(j % 255);
}
if (DEBUG_ALLOCS) {
std::cout << "Id " << myId << " New alloc " << pointers.size() << ", bytes:" << allocSize << " at " << (uint64_t) pnt << "\n";
}
// free all or just a bit
if (((j % 5) == 0) || (j == (NUM_ALLOCATIONS - 1))) {
int frees = 0;
// keep this much allocated
// last check, free all
uint64_t memLimit = MEM_LIMIT;
if (j == NUM_ALLOCATIONS - 1) {
std::cout << "Id " << myId << " about to release all memory: " << (currAllocBytes / (double)(1024 * 1024)) << " MB" << std::endl;
memLimit = 0;
}
//MEM_LIMIT = 0; // DEBUG
while (pointers.size() > 0 && (currAllocBytes > memLimit)) {
// free one of the first entries to allow previously obtained resources to 'live' longer
currAllocBytes -= pointers.front().second;
char* pnt = pointers.front().first;
// free memory
if (USE_NEW) {
delete[] pnt;
} else {
free(pnt);
}
// update array
pointers.pop_front();
if (DEBUG_ALLOCS) {
std::cout << "Id " << myId << " Free'd " << pointers.size() << " at " << (uint64_t) pnt << "\n";
}
frees++;
}
if (DEBUG_ALLOCS) {
std::cout << "Frees " << frees << ", " << currAllocBytes << "/" << MEM_LIMIT << ", " << totalAllocBytes << "\n";
}
}
} // for each allocation
if (currAllocBytes != 0) {
std::cerr << "Not all free'd!\n";
}
std::cout << "Id " << myId << " done, total alloc'ed " << ((double) totalAllocBytes / (double)(1024 * 1024)) << "MB \n";
} // for each iteration
exit(1);
}
int main(int argc, char** argv)
{
runParallelAllocTest();
return 0;
}
The Test-System
From what I see so far, the hardware matters a lot. The test might need adjustments if run on a faster machine.
Intel(R) Core(TM)2 Duo CPU T7300 @ 2.00GHz
Ubuntu 10.04 LTS 64 bit
gcc 4.3, 4.4, 4.6
3988.62 Bogomips
Testing
Once you have executed the makefile, you should get a file named ompmemtest. To query the memory usage over time, I used the following commands:
./ompmemtest &
top -b | grep ompmemtest
Which yields the quite impressive fragmentation or leaking behaviour. The expected memory consumption with 4 threads is 1090 MB, which became 1500 MB over time:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11626 byron 20 0 204m 99m 1000 R 27 2.5 0:00.81 ompmemtest
11626 byron 20 0 992m 832m 1004 R 195 21.0 0:06.69 ompmemtest
11626 byron 20 0 1118m 1.0g 1004 R 189 26.1 0:12.40 ompmemtest
11626 byron 20 0 1218m 1.0g 1004 R 190 27.1 0:18.13 ompmemtest
11626 byron 20 0 1282m 1.1g 1004 R 195 29.6 0:24.06 ompmemtest
11626 byron 20 0 1471m 1.3g 1004 R 195 33.5 0:29.96 ompmemtest
11626 byron 20 0 1469m 1.3g 1004 R 194 33.5 0:35.85 ompmemtest
11626 byron 20 0 1469m 1.3g 1004 R 195 33.6 0:41.75 ompmemtest
11626 byron 20 0 1636m 1.5g 1004 R 194 37.8 0:47.62 ompmemtest
11626 byron 20 0 1660m 1.5g 1004 R 195 38.0 0:53.54 ompmemtest
11626 byron 20 0 1669m 1.5g 1004 R 195 38.2 0:59.45 ompmemtest
11626 byron 20 0 1664m 1.5g 1004 R 194 38.1 1:05.32 ompmemtest
11626 byron 20 0 1724m 1.5g 1004 R 195 40.0 1:11.21 ompmemtest
11626 byron 20 0 1724m 1.6g 1140 S 193 40.1 1:17.07 ompmemtest
Please Note: I could reproduce this issue when compiling with gcc 4.3, 4.4 and 4.6(trunk).
When linking the test program with google's tcmalloc library, the executable doesn't only run ~10% faster, but shows greatly reduced or insignificant memory fragmentation as well:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13441 byron 20 0 379m 334m 1220 R 187 8.4 0:02.63 ompmemtestgoogle
13441 byron 20 0 1085m 1.0g 1220 R 194 26.2 0:08.52 ompmemtestgoogle
13441 byron 20 0 1111m 1.0g 1220 R 195 26.9 0:14.42 ompmemtestgoogle
13441 byron 20 0 1131m 1.1g 1220 R 195 27.4 0:20.30 ompmemtestgoogle
13441 byron 20 0 1137m 1.1g 1220 R 195 27.6 0:26.19 ompmemtestgoogle
13441 byron 20 0 1137m 1.1g 1220 R 195 27.6 0:32.05 ompmemtestgoogle
13441 byron 20 0 1149m 1.1g 1220 R 191 27.9 0:37.81 ompmemtestgoogle
13441 byron 20 0 1149m 1.1g 1220 R 194 27.9 0:43.66 ompmemtestgoogle
13441 byron 20 0 1161m 1.1g 1220 R 188 28.2 0:49.32 ompmemtestgoogle
13441 byron 20 0 1161m 1.1g 1220 R 194 28.2 0:55.15 ompmemtestgoogle
13441 byron 20 0 1161m 1.1g 1220 R 191 28.2 1:00.90 ompmemtestgoogle
13441 byron 20 0 1161m 1.1g 1220 R 191 28.2 1:06.64 ompmemtestgoogle
13441 byron 20 0 1161m 1.1g 1356 R 192 28.2 1:12.42 ompmemtestgoogle
From the data I have, the answer appears to be:
Multithreaded access to the heap can emphasize fragmentation if the employed heap library does not deal well with concurrent access and if the processor fails to execute the threads truly concurrently.
The tcmalloc library shows no significant memory fragmentation running the same program that previously caused ~400MB to be lost in fragmentation.
But why does that happen ?
The best idea I have to offer here is some sort of locking artifact within the heap.
The test program will allocate randomly sized blocks of memory, freeing up blocks allocated early in the program to stay within its memory limit. When one thread is in the process of releasing old memory which is in a heap block on the 'left', it might actually be halted as another thread is scheduled to run, leaving a (soft) lock on that heap block. The newly scheduled thread wants to allocate memory, but may not even read that heap block on the 'left' side to check for free memory as it is currently being changed. Hence it might end up using a new heap block unnecessarily from the 'right'.
This process could look like a heap-block-shifting, where the the first blocks (on the left) remain only sparsely used and fragmented, forcing new blocks to be used on the right.
Lets restate that this fragmentation issue only occurs for me if I use 4 or more threads on a dual core system which can only handle two threads more or less concurrently. When only two threads are used, the (soft) locks on the heap will be held short enough not to block the other thread who wants to allocate memory.
Also, as a disclaimer, I didn't check the actual code of the glibc heap implementation, nor am I anything more than novice in the field of memory allocators - all I wrote is just how it appears to me which makes it pure speculation.
Another interesting read might be the tcmalloc documentation, which states common problems with heaps and multi-threaded access, some of which may have played their role in the test program too.
Its worth noting that it will never return memory to the system (see Caveats paragraph in tcmalloc documentation)