I would like to discuss the issue of creating stable code in this article. Because this is a huge topic, enough to fill an entire book on its own, I will keep to the core ideas and keep the article brief.
There are two main paradigms in the pursuit of creating stable software. The one is based on creating a complex, unified system which is designed to have no memory leaks, no bugs, must run forever and have a high capacity. The other is based on the model where processes are small, highly focussed on doing only one task and doing it extremely fast and effective, memory leaks does not really matter and they do not run forever. They get started and die as they are needed. Hence the motivation that memory leaks and some type of bugs are not critical. Microsoft SQL Server is an example of the former, and Unix's Apache Web Server is an example of the latter.
I am going to discuss some guidelines on writing stable software using the first mentioned paradigm, as this paradigm is wide in use and is the most difficult to implement correctly.
The first issue I want to discuss is that of exceptions and errors. But maybe a definition of those terms is now in order. An error refers to the origin of a problem found in the execution of a process which forces unexpected program flow, and an exception refers to a the encapsulation of an error into a manageable exception object that can be thrown by the code that incurred the error, and be caught in a higher level which can deal with the problem. As an example, an error could be a division by zero operation, then the system (or the software itself) will throw an exception for this error so that it can be handled properly. The higher level exception handler might then report the error to the user, or do whatever is applicable.
Sometimes it is useful to keep a list of objects in memory. An example can be a list of connection objects of
TCP/IP connections to a server. When data is received, a function can enumerate through this list and find the
correct connection object by matching the socket address, for example. It will then do whatever it needs to do
on this connection object. In large systems, it is better to use a methodology called reference
counting instead of classical create/destroy methods. It is the same principle on which COM objects
work in the Windows environment. For a discussion on a custom written object that is a descendant of TObject
(Delphi/Kylix) and is based on reference counting, see Rudolph van Graan's web site. This object, called TLockableObject
,
exposes the same functionality and behavior as the TObject
class, but adds reference counting to
it.
Now, assume you have lots of TConnection
objects, which is derived ultimately from TLockableObject
.
Assume in the code handling new connections, you create a new connection object and adds it to the list. The
reference count will be 1. Lets say later in the code that handles received data from the port, this list is
enumerated for the applicable connection object and then this connection object is submitted to a kernel which will
asynchronously process the request. Obviously we need to increment the reference count when we submit this
connection object to the kernel, as we need to make sure that the connection object will not be freed on a
unexpected disconnect event and the kernel hence accessing a freed object (causing a memory read access violation).
The reference count is now 2. Can you see that if the system is large and complex, it would be necessary to not make
any assumptions about the validity of the objects in the list. Each time the list is enumerated in search for a
connection object, a check must be added to verify the connection object is not nil (which would be the case when
the reference count goes to 0 and the correct methods had been used to set the object's value to nil).
for (int i=0;i<mConnectionList->count();i++) {
vObj = mConnectionList->GetObject(i);
if (vObj != NULL)
// then do something...
else
DoElse();
}
Now in many of the code I have reviewed and read, I saw the implementation for DoElse()
look like
this:
for (int i=0;i<mConnectionList->count();i++) {
vObj = mConnectionList->GetObject(i);
if (vObj != NULL)
// then do something...
else
continue;
}
You may ask "But what is wrong with that? If the connection object retrieved is NULL, we must just make sure we
do not work with it.". Well, the problem with that piece of code lies deeper than a casual inspection would
reveal. A system had been designed in a company I used to work for that managed lots of these type of
connections and other objects, all being passed around using reference counting and being stored in lists. It
had very, very weird and inconsistent behaviors in which sometimes a connection just would not show up in the
connection list, or sometimes it would not change state correctly etc. Also, it had a subtle memory leak that
under certain conditions, would be as bad as leaking 200MB in a week's time. For a system designed to run
forever, this is obviously totally unacceptable! The problem was found to be a combination of several different
things, under which the abovementioned issue was one. Like Rudolph explains in his web site, it is very
important to evaluate the else
condition. It makes you think deeper about what your application
does and should do in case of unexpected behavior. In this case the easiest way out of this problem was to use a
continue
statement. In 99% of the cases this should work fine, and if there was no bugs or
programming errors in the code, it should work 100% of the time because the else would never get executed. I
call this method Silent error suppression. This means handling error conditions in such a way that the
system keeps on working in a transparent way; the error had been suppressed.
As any good developer will know, us humans cannot write bug free code. In my humble opinion this is impossible,
as the amount of external influences are too great to cater for all conditions. An extreme example - you have
crafted the perfect system. You have a for
loop that does something. You assume (quite correctly)
that the index would increment by one through each iteration. Now there is a power brownout in the system which
runs your application. Some bits in memory were destroyed. Accidentally also one bit in the memory location that
holds the index variable. Now suddenly your variable is greater than the bound condition, and you access memory
in an invalid array index. Boom! The program crashes. Okay - I agree, this is an extreme situation and one
should not try to cater for these type of situations simply because it would cause absolute code bloat and would
not be really effective in any case. But the idea sticks - one cannot write 100% robust, stable and bugfree
code.
The point I am trying to make is simple - because we cannot write bugfree code, one cannot assume that "it should work 100% of the time". It will work 99% of the time. It is (quite) possible that somewhere you made a mistake and freed the connection object twice. This loop would suppress the error, hence helping you not to know about the bug you introduced elsewhere. See how this methodology helps in preventing writing stable, robust code?
I have a different attitude towards these kind of situations. I would rewrite that loop as:
for (int i=0;i<mConnectionList->count();i++) {
vObj = mConnectionList->GetObject(i);
if (vObj != NULL)
// then do something...
else
throw exception("Internal error #123 encountered: "+
"vObj = NULL in method XX::YY()");
}
See how this would cause the program to alert about the error? A higher level layer could then inform the user about this condition, and terminate the program gracefully. This bug can then be submitted to the developer and it can be corrected. This is in contrast to the belief where this is a recoverable situation. In this case, it is not. If that object is nil, then although your program will seem to keep on running fine, it will have an inconsistent state. If you scream at all inconsistent (should never happen) situations, then you will be able to see much faster when a bug is present, and also have a good idea of where it affects the system. Note that I do not mean normal exceptional conditions, like when parsing a file for example, and you expect each record to be on the following line, but encountered one empty line. This would most probably not indicate a failure and merely adds a bit of adaptability to your code if you then merely skip all blank lines and therefore silently suppressing the normal error condition.
Another common mistake novice (and sometimes experienced) developers makes is illustrated in the following piece of code:
unsigned long faculty(int n)
{
if (n <= 1)
return 1;
else
return n * faculty(n-1);
}
...
unsigned long n_fact;
int y;
y = g(x);
n_fact = faculty(y);
In this piece of code, we have a function that calculates the faculty of n
, i.e. n!
. We
also have some code that assigns a value for y
based on the function g(x)
, which can
do anything as long as it returns an integer. As you can see, this code will work correctly for valid values of
n
, where valid is defined as n
= integer, n >= 0
. So we need to add
some parameter validation checking. Lets change the main code to:
...
unsigned long n_fact;
int y;
y = g(x);
if (y >= 0)
n_fact = faculty(y);
else
throw exception("The faculty of a negative number cannot be taken");
This code should now work perfectly and negative values of y
will not break the code. But there is a
huge problem with this method, and this problem is the root cause for many unstable and buggy systems. The
assumption made here is that you are the only user of the function faculty()
and that you know how
it works inside. Therefore you know which values you may and may not pass to that function. But as it normally
happens, those small functions we write become very useful, and get reused very frequently. Also, later it is
not only you (the author) who uses it, but lots of other people as well. The problem is that they should not and
do not have to know the details of how the faculty()
function works - the only thing they
need to know is that it will calculate n! and what happens on wrong input.
Can you see that everywhere this function is to be used, it would be necessary to add a check or two on the
validity of the data before passing it to the faculty
function? This is wrong. The validation
checks should be an integral part of the smallest building blocks of a system - starting with those small
functions. If I change the code like this:
unsigned long faculty(int n)
{
if (n < 0)
throw emaths("n cannot be negative");
else if (n <= 1)
return 1;
else
return n * faculty(n-1);
}
...
unsigned long n_fact;
int y;
y = g(x);
try {
n_fact = faculty(y);
}
catch (emaths& E) {
// Do something about the error
}
Immediately the burden of doing parameter validation checking is removed from the code which uses the function,
and moved into the function itself where it can be tested, debugged and reused. Now all the person needs to know
about the function, is that it returns the faculty of n
for valid n
, and throws an
emaths exception on error. The faculty()
function is now self contained. Of course a better
solution would have been to declare n to be of type unsigned
instead...
Although I wanted to stay away from OS specific or language specific issues, I want to discuss a brilliant technique one can use in C++ to ensure proper releasing of allocated memory for objects. But first, an Object Pascal example of how to do proper object allocation/destroying:
procedure DoSomething(const pRegKey: String);
var
vRegistry: TRegistry;
begin
vRegistry := TRegistry.Create(pRegKey);
if (vRegistry = nil) then
raise Exception.Create('Exception in DoSomething() - vRegistry = nil');
try
// Do something that could cause an exception to be thrown
finally
vRegistry.Free;
end;
end;
In the above function, one can always be sure that vRegistry
will always be freed - even if an
exception occurs. This is made possible using the try-finally
construct of Object Pascal. C++ does
not have this construct. A first attempt might look like this:
void DoSomething(std::string pRegKey)
{
TRegistry *vRegistry;
vRegistry = new TRegistry(pRegKey);
// new will throw an exception on failure
try {
// Do something that could cause an exception to be thrown
}
catch (...) {
delete vRegistry;
throw;
}
delete vRegistry;
}
If the code throws an exception in the try-catch
block, the exception would be caught, the vRegistry
object freed and the exception would be re-thrown. If no exception occurs, the vRegistry
object
will still be freed. This method is verbose and error prone. There is a very elegant solution to this problem
specific to C++ (and any language that supports operator overloading). It is called Resource acquisition is
initialization, and is described by Bjarne Stroustrup - the creator of C++. The aforementioned example
can be rewritten as:
class TReg {
private:
TRegistry* mReg;
public:
TReg(const string& pRegKey) : mReg(NULL)
{mReg = new TRegistry(pRegKey);}
~TReg() {delete mReg;}
operator TRegistry*() {return mReg;}
TRegistry* operator ->() {return mReg;}
};
void DoSomething(std::string pRegKey)
{
TReg vRegistry(pRegKey);
try {
// Do something that could cause an exception to be thrown
}
catch (...) {
// Handle exception
}
}
Can you see that in this code we create an instance of TReg
, then use it and forget about it? The
reason we are allowed to do this is because vRegistry
is now a local variable that will be
destroyed when the function returns. When a class gets destroyed, the destructor will always be called.
To step through the code, start with the line TReg vRegistry(pRegKey);
. In that line, a new object
of type TReg
will be instantiated with the parameter pRegKey
being passed to the
constructor. In the constructor, a new instance of the actual TRegistry
class will be instantiated.
The two overloaded operators allow one to use vReg
as if it was an instance of
TRegistry
. Hence if TRegistry
had a method called ReadKey()
, I could use
the following syntax in the function DoSomething(): vRegistry->ReadKey();
. Finally, whenever the
function returns (no matter whether it is due to normal flow or an exception), the vRegistry
object
will be destroyed because of it going out of scope. When that happens, the destructor will be called
automatically and the instance of TRegistry
will be freed. See how this method brilliantly solves
the problem of acquiring and releasing resources? Here is a general template that you may use in your code:
namespace RAiI {
template<class T> class TResource {
private:
T* mData;
int mCount;
public:
TResource() : mData(NULL), mCount(0) {mData = new T;}
TResource(int n) : mData(NULL), mCount(n)
{if (n == 0) mData = new T; else mData = new T[n];}
~TResource() {if (mCount == 0) delete mData; else delete [] mData;}
operator T*() {return mData;}
T* operator ->() {return mData;}
};
};
...
// Use it as:
RAiI::TResource<TResourceToProtect> vResource;
// or...
RAiI::TResource<TResourceToProtect> vResource(vSizeOfArray);
// and access the member functions and data members of
// TResourceToProtect with the -> operator
vResource->SomeMemberFunctionOfTResourceToProtect();