next up previous contents
Next: 13.2 Automatic Garbage Collection Up: 13 Garbage Collection Previous: 13 Garbage Collection   Contents

13.1 Storage Allocation

Consider a typical storage organization of a program:

stack.gif

All dynamically allocated data are stored in the heap. These are the data created by malloc in C or new in C++. You can imagine the heap as a vector of bytes (characters) and end_of_heap a pointer to the first available byte in the heap:

char heap[heap_size];
int end_of_heap = 0;
A trivial implementation of malloc in C is:
void* malloc ( int size ) {
  void* loc = (void*) &heap[end_of_heap];
  end_of_heap += size;
  return loc;
};
If you always create dynamic structures but never delete them, you will eventually run out of heap space. In that case, malloc will request the operating system for more heap (this code is not shown). This is very expensive, because it may require to move the stack data to higher memory locations. Alternatively, you can recycle the dynamically allocated data that you don't use. This is done using free in C or delete in C++. To do so, you need to link all deallocated data in the heap into a list:
typedef struct Header { struct Header *next; int size; } Header;
Header* free_list;
Initially, the free list contains only one element that covers the entire heap (ie, it's heap_size bytes long):
void initialize_heap () {
  free_list = (Header*) heap;
  free_list->next = 0;
  free_list->size = heap_size;
};
(This is terrible C code, but this can only be done by coercing bytes to pointers.) Note that each cell in the heap has a Header component (ie, a next pointer and a size) followed by size-sizeof(Header) bytes (so that is size bytes long). malloc first searches the free list to find a cell large enough to fit the given number of bytes. If it finds one, it gets a chunk out of the cell leaving the rest untouched:
void* malloc ( int size ) {
  Header* prev = free_list;
  for(Header* r=free_list; r!=0; prev=r, r=r->next)
     if (r->size > size+sizeof(Header))
     {  Header* new_r = (Header*) (((char*) r)+size);
        new_r->next = r->next;
        new_r->size = r->size;
        if (prev==free_list)
           free_list = new_r;
        else prev->next = new_r;
        return (void*) r;
     };
  void* loc = (void*) &heap[end_of_heap];
  end_of_heap += size;
  return loc;
};
free simply puts the recycled cell at the beginning of the free list:
void free ( void* x ) {
  if ("size of *x" <= sizeof(Header))
     return;
  ((Header*) x)->next = free_list;
  ((Header*) x)->size = "size of *x";
  free_list = (Header*) x;
};
We can see that there is a lot of overhead in malloc since the free list may be very long. The most important problem though is fragmentation of the heap into tiny cells, so that even though the total free space in the free list is plenty, it is useless for large object allocation. The reason for fragmentation is that we unrestrictedly choose a cell from the free list to do allocation. It makes more sense to try to find a free cell that provides the requested number of bytes exactly. This can be done by maintaining a vector of free lists, so that the nth element of the vector is a free list that links all the free cells of size n.

The above malloc/free way of allocating dynamic cells is called a manual memory allocation since the programmer is responsible for allocating and deleting objects explicitly. This is to be contrasted to the automatic memory management, in which the system manages the dynamic allocation automatically. Manual memory management is the source of the worst and hardest to find bugs. It's also the source of most of the mysterious program crashes. It takes the fun out of programming. It causes horrible memory problems due to ``overflow", ``fence past errors", ``memory corruption", ``step-on-others-toe" (hurting other variable's memory locations) or ``memory leaks". The memory problems are extremely hard to debug and are very time consuming to fix and troubleshoot. Memory problems bring down the productivity of programmers. Memory related bugs are very tough to crack, and even experienced programmers take several days or weeks to debug memory related problems. Memory bugs may be hidden inside the code for several months and can cause unexpected program crashes. It is estimated that the memory bugs due to usage of char* and pointers in C/C++ is costing $2 billions every year in time lost due to debugging and downtime of programs. Now of course all modern languages (Java, SML, Haskell, etc) use automatic garbage collection. Despite that, C and C++ are still popular and programmers spend most of their time trying to do manual memory allocation (and then, after they remove most core dumps from their program, they port it to a different architecture and their program crashes!).

Why manual memory management is so hard to do correctly? Basically, you need to have a global view of how dynamic instances of a type are created and passed around. This destroys the good software engineering principle that programs should be developed in small independent components. The problem is to keep track of pointer assignments. If more than one objects point to a dynamically allocated object, then the latter object should be deleted only if all objects that point to it do not need it anymore. So basically, you need to keep track of how many pointers are pointing to each object. This is called reference counting and used to be popular for OO languages like C++. We do not call this method automatic garbage collection because the programmer again is responsible of putting counters to every object to count references. This method is easy to implement for languages like C++ where you can redefine the assignment operator dest=source, when both dest and source are pointers to data. Instead of using C* for a pointer to an object C, we use Ref<C>, where the template Ref provides reference counting to C:

template< class T >
class Ref {
private:
   int count;
   T* pointer;
   void MayDelete () { if (count==0) delete pointer; };
   void Copy ( const Ref &sp ) {
         ++sp.count;
         count = sp.count;
         pointer = sp.pointer;
   };
public:
   Ref ( T* ptr = 0 ) : count(1), pointer(ptr) {};
   Ref ( const Ref &sp ) { Copy(sp); };
   ~Ref () { MayDelete(); };
   T* operator-> () { return pointer; };
   Ref& operator= ( const Ref &sp ) {
      if (this != &sp) {
         count--;
         MayDelete();
         Copy(sp);
      };
      return *this;
   };
};

When an instance of Ref<C> is created, its counter is set to 1. When we delete the object, we just decrement the counter because there may be other objects pointing to it. But when the counter becomes zero, we delete it. The only way to move pointers around, is through an assignment from a C* object to a C* object. In that case, the object pointed by the source pointer will have one pointer more, while the object pointed by the destination pointer will have one pointer less. Reference counting avoids some misuses of the heap but it comes with a high cost: every assignment takes many cycles to be completed and some of the work may be unnecessary since you may pass a pointer around causing many unnecessary counter increments/decrements. In addition, we cannot get rid of cyclic objects (eg. when A points to B and B points to A) using reference counting. All objects that participate in a cyclic chain of references will always have their counters greater than zero, so they will never be deleted, even if they are garbage.


next up previous contents
Next: 13.2 Automatic Garbage Collection Up: 13 Garbage Collection Previous: 13 Garbage Collection   Contents
2015-01-20