Handling Configuration Files
A grad student who is part of the University of Wisconsin – Madison’s excellent Hacker Within group asked a question last week that deserves a longer answer than I gave at the time. The question was, “How should I pass configuration parameters into my program?” (Actually, her original question was, “How do I write a configuration file parser in C++?”, but that presupposes an answer to the one I’m going to discuss here.) Scientists often need to do this—to run a simulation for different reactant concentrations, or experiment with the effects of different clustering thresholds on phylogenetic tree reconstruction—so let’s have a look at some of the options.
Method #1: change the constant or variable definitions in the program, and recompile it for each run. For example, if you’re using C++, you can define constants in a header file like this:
// // params.h : control parameters for simulation // #define T_QUENCH 300.0 #define T_EXCITE 450.0
or even better, use proper constants like this:
// // params.h : control parameters for simulation (definitions) // extern const float T_QUENCH; extern const float Q_EXCITE;
// // params.cpp : control parameters for simulation (values) // #include "params.h" const float T_QUENCH = 300.0; const float Q_EXCITE = 450.0;
Each time you want to change the values, you edit params.cpp, run Make to recompile, and then run your program. If you’re really clever, you’ll put the command to re-run the program in the Makefile, and arrange your dependencies so that it recompiles the program if necessary.
Software engineering purists will recoil in horror at this, but if you only need to change the program’s parameters a few times, it may actually be the simplest choice, since you don’t have to write (or test, or debug) any kind of configuration code. The main drawback is that anyone else who wants to use your program has to set up a complete build environment.
Method #2: pass parameters as command-line arguments. If you only have a handful of scalar parameters (i.e., parameters that consist of a single value, like the two in the example above, as opposed to parameters comprised of lists of values, like wave forms), then command-line parameters may be the easiest way to go. In C++, these are passed as strings to your main function via argc and argv. If your program is run like this:
$ ./anneal 300.0 450.0
then the following will work:
int main(
int argc, // number of command-line arguments
char ** argv // array of pointers to command-line arguments
){
const char * program_name = argv[0];
const float t_quench = atof(argv[1]);
const float t_excite = atof(argv[2]);
run_simulation(t_quench, t_float);
return 0;
}
Of course, purists are spluttering again right now. First, atof returns 0 both when it gets the string “0″, and when it fails, so it’s never safe to use—you should use sscanf instead. Second, this program doesn’t check that parameters have been passed in the right order: someone could pass 450.0 300.0 by mistake. (It’s unlikely in this case, since the quenching temperature is always less than the excitement temperature, but in other cases, where the parameters don’t have a natural order, transposition mistakes are very easy to make.)
The right way to do this is to use something like the getopt library so that the command line is:
$ ./anneal -q 300.0 -e 450.0
or even better:
$ ./anneal --quench 300.0 --excite 450.0
This provides a little bit of documentation (if you use history to look at recently-run commands, you can easily read off the parameters you’ve been using). It also means that you can run the program like this:
$ ./anneal --excite 450.0 --quench 300.0
with no ill effects: parameters are picked off by name, not position, which is a lot safer when there are more than two or three.
OK, so why wouldn’t you do this? For one, you might have so many parameters that this becomes cumbersome. (As a rule of thumb based solely on personal taste, if there are more than half a dozen, you should be thinking about doing it some other way.) Second, if some of those parameters are multi-valued, this approach starts to break down as well. To come back to the example alluded to above, if one of the parameters to your program is a sampled wave form that’s used to filter signals, you don’t really want to type:
$ ./throttle --waveform 0.0000105 0.0000209 0.0000410 ... 0.0152720
Method #3: put parameters in a plain text file. Almost everyone gets here eventually. Put your parameters in a file like this:
quench 300.0 anneal 450.0
and read it with code like this (written in Python for the sake of brevity and readability):
import sys
# The parameter file name is the program's sole argument.
reader = open(sys.argv[1], 'r')
for line in reader:
name, value = line.split()
if name == 'quench':
t_quench = float(value)
elif name == 'anneal':
t_anneal = float(value)
else:
print >> sys.stderr, 'Bad parameter name "%s"' % name
sys.exit(1)
run_simulation(t_quench, t_float)
It works, but we can do better: much better. First, let’s allow blank lines and comments beginning with ‘#’:
for line in reader:
line = line.split('#')[0].strip()
if not line:
continue
name, value = line.split()
...as before...
The three lines in bold face take everything that was before the first ‘#’ on the line and strip off any leading and trailing whitespace. If the result is the empty string, the line was blank, or consisted solely of a comment. Either way, the program continues on to the next line without trying to get a parameter name and value.
This version still doesn’t handle multi-valued parameters, but it’s pretty easy to change the line.split() call and what follows it to do so. We’ll leave that as an exercise for the reader, though, and look at something else instead. Suppose that some values are floats, but others are integers or strings (such as an output file name). Here’s a much cleaner way to take care of parsing:
Handlers = {
'border' : int,
'excite' : float,
'output' : str,
'quench' : float
}
reader = open(sys.argv[1], 'r')
params = {}
for line in reader:
line = line.split('#')[0].strip()
if not line:
continue
name, value = line.split()
if name not in Handlers:
print >> sys.stderr, 'Bad parameter name "%s"' % name
sys.exit(1)
if name in params:
print >> sys.stderr, 'Duplicate parameter name "%s"' % name
sys.exit(1)
conversion_func = Handlers[name]
params[name] = conversion_func(value)
run_simulation(params)
Handlers is a dictionary that maps parameter names to functions that know how to convert string representations of those functions to—well, to whatever type they’re supposed to be. Each time a name/value pair is read from the file, this program checks that there is a conversion function (which doubles as a check that the parameter’s name is one we recognize), then checks that we don’t already have a value for that parameter (i.e., that the file doesn’t mistakenly include duplicates). It then applies the conversion function to the value’s string representation, and stores the result. All the parameter values are then passed into the simulation in one tidy dictionary.
This approach is nice because all the logic for parsing parameters can be re-used in other programs—including future versions of this program. For example, if we want to add α and β for controlling crystallization speed, the only thing we have to change is the “table” of parameter names and conversion unctions stored in Handlers:
Handlers = {
'alpha' : float,
'beta' : float,
'border' : int,
'excite' : float,
'output' : str,
'quench' : float
}
This also gives us a natural place to put in error checking—we just write our own conversion functions, like this:
def convert_alpha(text):
raw = float(text)
if (raw < 0.0) or (raw > 1.0):
print >> sys.stderr, 'alpha out of range: "%f"' % raw
return raw
Yes, we can be smarter so that we don’t have to write a separate function for each parameter, but let’s not go down that path. In fact, let’s not go down this path, because there are already lots of configuration file syntaxes and parsers out there. Writing one of our own may be fun, but it’s busy-work: if we really need configuration files, we should grab a library and use that. Which brings us to…
Method #4: put parameters in a structured text file that can be parsed by an existing libraries. Whatever your needs, the odds are good that someone else has met them before, along with others that you haven’t yet (but probably will). The odds are also pretty good that code in your favorite language’s standard library will be less buggy right off the bat than anything you could write yourself.
But what syntax to use? One option is Windows INI files; another is XML, and of course there’s the new hipster on the block, JSON. Of these, INI is the only one designed first and foremost to be written and read by human beings; XML and JSON both defer more to machines’ needs, which makes typing them in more painful.
Method #5: put parameters in a dynamically-loaded code module. This is the nerdiest option, and to understand it, we have to step back for a moment and look at how computers run programs. If I type:
import blarg
in a Python program, the Python interpreter:
- find a file called
blarg.py; - reads the text it contains into a string in memory;
- parses that string according to Python’s syntax rules to create byte code instructions;
- executes those instructions (which are typically value and function definitions);
- stores the results in an object that’s a lot like a dictionary; and
- assigns that object to a variable called
blarg.
OK, so suppose I create a Python file called config.py that contains:
t_quench = 300.0 t_excite = 450.0
If my program contains:
import config
then config.t_quench has the value 300.0, and config.t_excite has the value 450.0. We didn’t have to parse anything: Python did it for us. And hey, we can now use expressions and conditionals in our “configuration file” for free:
t_quench = 300.0
t_excite = 4.0 * t_quench / 5.0
if t_excite < 500.0:
alpha = 0.0
beta = 1.0
else:
alpha = 0.2
beta = 0.8
But wait: isn’t this just method #1 all over again? Don’t we have to edit config.py each time we want to change parameter values? Well, no: if you use the importlib library’s import_module function, you can specify what you want to import dynamically, i.e., provide a name like “config_low_alpha_binding” as a command-line parameter, and have Python load parameters from config_low_alpha_binding.py.
Nerds like me tend to like this option a lot: simple things remain simple, but we have the full power of a programming language when we need it (i.e., we don’t have to invent our own clumsy syntax for expressions and conditionals in configuration files). The downsides are:
- It’s hard for non-nerds to understand what’s going on—dynamic code loading is a pretty advanced concept;
- it’s a lot harder to do in compiled languages like Fortran, C++, Java, and C# than it is in dynamic languages like Python, Perl, and Ruby (for historical reasons, MATLAB lies somewhere in between); and
- it’s not portable between different languages: C++, Java, Fortran, MATLAB, and Python all agree on the syntax of XML, but only one of them can read Python.
In practice, #1 and #2 matter most: if speed is important, you’ve probably written your code in a compiled language, which means that loading other bits of code on the fly and pulling out values is hard both technically and intellectually. Some people try to get around this by writing the configuration and control parts of their program in a dynamic language, while leaving the computational core in a compiled language, but that’s just trading one problem for another: building, debugging, and maintaining multi-language programs is not something to be undertaken lightly.
Method #6: build a configuration GUI. I’ve included this option for the sake of completeness, and because it’s usually essential if you want your program to be widely used. Building a desktop or web-based user GUI takes a lot of time, but if done well, can make your program much more accessible—particularly if the same GUI allows people to visualize the program’s output. Be warned, though: GUI construction is a very easy way to procrastinate…
No matter how you get parameters into your program, there’s one rule you should never break: always (always!) include those values in your program’s output, so that when you come back and re-examine old output files a year later, you know exactly what parameters were used. This practice is a small but crucial part of tracking the provenance of your data, a topic we’ll return to in a future post.
