Consider a typical storage organization of a program:
All dynamically allocated data are stored in the heap.
These are the data created by malloc in C or new in C++.
You can imagine the heap as a vector of bytes (characters) and end_of_heap
a pointer to the first available byte in the heap:
char heap[heap_size]; int end_of_heap = 0;A trivial implementation of
malloc in C is:
void* malloc ( int size ) {
void* loc = (void*) &heap[end_of_heap];
end_of_heap += size;
return loc;
};
If you always create dynamic structures but never delete them,
you will eventually run out of heap space. In that case, malloc
will request the operating system for more heap (this code is not shown). This is very expensive,
because it may require to move the stack data to higher memory locations.
Alternatively, you can recycle the dynamically allocated data that you don't use.
This is done using free in C or delete in C++.
To do so, you need to link all deallocated data in the heap into a list:
typedef struct Header { struct Header *next; int size; } Header;
Header* free_list;
Initially, the free list contains only one element that covers the entire heap
(ie, it's heap_size bytes long):
void initialize_heap () {
free_list = (Header*) heap;
free_list->next = 0;
free_list->size = heap_size;
};
(Pardon me for the terrible C code, but this can only be done by coercing bytes to pointers.)
Note that each cell in the heap has a Header component (ie, a next
pointer and a size) followed by size-sizeof(Header) bytes (so that is size bytes long).
malloc first searches the free list to find a cell large enough to fit the given number of bytes.
If it finds one, it gets a chunk out of the cell leaving the rest untouched:
void* malloc ( int size ) {
Header* prev = free_list;
for(Header* r=free_list; r!=0; prev=r, r=r->next)
if (r->size > size+sizeof(Header))
{ Header* new_r = (Header*) (((char*) r)+size);
new_r->next = r->next;
new_r->size = r->size;
if (prev==free_list)
free_list = new_r;
else prev->next = new_r;
return (void*) r;
};
void* loc = (void*) &heap[end_of_heap];
end_of_heap += size;
return loc;
};
free simply puts the recycled cell at the beginning of the free list:
void free ( void* x ) {
if ("size of *x" <= sizeof(Header))
return;
((Header*) x)->next = free_list;
((Header*) x)->size = "size of *x";
free_list = (Header*) x;
};
We can see that there is a lot of overhead in malloc since the
free list may be very long. The most important problem though is
fragmentation of the heap into tiny cells, so that even though the
total free space in the free list is plenty, it is useless for large
object allocation. The reason for fragmentation is that we
unrestrictedly choose a cell from the free list to do allocation. It
makes more sense to try to find a free cell that provides the requested
number of bytes exactly. This can be done by maintaining a vector of
free lists, so that the nth element of the vector is a free list that
links all the free cells of size n.
The above malloc/free way of allocating dynamic cells is called a manual memory allocation since the programmer is responsible for allocating and deleting objects explicitly. This is to be contrasted to the automatic memory management, in which the system manages the dynamic allocation automatically. Manual memory management is the source of the worst and hardest to find bugs. It's also the source of most of the mysterious program crashes. It takes the fun out of programming. It causes horrible memory problems due to ``overflow'', ``fence past errors'', ``memory corruption'', ``step-on-others-toe'' (hurting other variable's memory locations) or ``memory leaks''. The memory problems are extremely hard to debug and are very time consuming to fix and troubleshoot. Memory problems bring down the productivity of programmers. Memory related bugs are very tough to crack, and even experienced programmers take several days or weeks to debug memory related problems. Memory bugs may be hidden inside the code for several months and can cause unexpected program crashes. It is estimated that the memory bugs due to usage of char* and pointers in C/C++ is costing $2 billions every year in time lost due to debugging and downtime of programs. Now of course all modern languages (Java, SML, Haskell, etc) use automatic garbage collection. Despite that, C and C++ are still popular and programmers spend most of their time trying to do manual memory allocation (and then, after they remove most core dumps from their program, they port it to a different architecture and their program crashes!).
Why manual memory management is so hard to do correctly? Basically,
you need to have a global view of how dynamic instances of a type are
created and passed around. This destroys the good software engineering
principle that programs should be developed in small independent
components. The problem is to keep track of pointer assignments. If
more than one objects point to a dynamically allocated object, then
the latter object should be deleted only if all objects that point to
it do not need it anymore. So basically, you need to keep track of
how many pointers are pointing to each object. This is called reference counting and used to be popular for OO languages like C++.
We do not call this method automatic garbage collection because the
programmer again is responsible of putting counters to every object to
count references. This method is easy to implement for languages like
C++ where you can redefine the assignment operator dest=source,
when both dest and source are pointers to data.
Instead of using C* for a pointer to an object C,
we use Ref<C>, where the template Ref provides
reference counting to C:
template< class T >
class Ref {
private:
int count;
T* pointer;
void MayDelete () { if (count==0) delete pointer; };
void Copy ( const Ref &sp ) {
++sp.count;
count = sp.count;
pointer = sp.pointer;
};
public:
Ref ( T* ptr = 0 ) : count(1), pointer(ptr) {};
Ref ( const Ref &sp ) { Copy(sp); };
~Ref () { MayDelete(); };
T* operator-> () { return pointer; };
Ref& operator= ( const Ref &sp ) {
if (this != &sp) {
count--;
MayDelete();
Copy(sp);
};
return *this;
};
};
When an instance of Ref<C> is created, its counter is set to 1.
When we delete the object, we just decrement the counter because there
may be other objects pointing to it. But when the counter becomes
zero, we delete it. The only way to move pointers around, is through
an assignment from a C* object to a C* object. In that
case, the object pointed by the source pointer will have one pointer
more, while the object pointed by the destination pointer will have
one pointer less. Reference counting avoids some misuses of the heap
but it comes with a high cost: every assignment takes many cycles to
be completed and some of the work may be unnecessary since you may
pass a pointer around causing many unnecessary counter
increments/decrements. In addition, we cannot get rid of cyclic
objects (eg. when A points to B and B points to A) using reference
counting. All objects that participate in a cyclic chain of references
will always have their counters greater than zero, so they will never
be deleted, even if they are garbage.