Sunday, June 12, 2016

Formaline: The Provenance of Computational Astrophysics Simulations and their Results (Part I)

Key to all scientific research is reproducibility. In areas of science that rely heavily on computer simulations, the open-source approach is essential for allowing people to check the details of the simulation code and for re-running simulations for reproducing results.

Open-source is essential, but it's not enough. Anybody involved in simulations knows that the code used for running the actual simulations often differs in non-trivial ways from the actually released source code. Even when we use version control software like git, too often do we run simulations with a version of the code that is either not tagged or has uncommitted changes. And this does not take into account that complex simulation codes typically have a myriad of parameters that are set individually for each simulation and parameter files typically don't make it into publications or git repositories. A similar issue exists also with the initial conditions (i.e. the input data) from which a simulation is started. Rarely are these available for others to reproduce results.

Provenance is a word used in computational science (and in some other research areas). It describes the full history of a (computational) scientific result. So the provenance of a simulation does not only include the released open-source code, but also the full information on the exact source code that was used, perhaps even the way it was compiled and where it was run, the parameters that were set for the simulation, and all initial conditions and other ingredients that were used in whatever form (and also their respective provenances).

We have to tackle provenance and reproducibility step by step. In this two-part series of posts, my focus is on preserving the actual source code that was used to run a simulation and produce specific results or to create some other (astro)physics result (for example, a neutrino-matter opacity table used in core-collapse supernova simulations). This can be accomplished by what I will generally call Formaline. Formaline is a chemical compound, a 10% solution of formaldehyde in water; used as a disinfectant or to preserve biological specimens. Formaline for code/simulations is to preserve code and simulation/physics metadata. Formaline is an idea by Erik Schnetter, Research Technologies Group Lead at Perimeter Institute for Theoretical Physics. Erik implemented Formaline for the first time for the Cactus Computational Toolkit some time in the early 2000s.

Using Formaline is super easy! (Python example)

Formaline comes in two forms and in both forms it preserves source code: (1) It can include source code in the binary of the simulation code and, upon execution, the code writes the source code (usually in a tar ball) into the output directory. (2) It can include source code and parameter files in HDF5 data files that are now very broadly used for simulation and physics inputs to simulations (such as a neutrino interaction table).

I'll talk about Formaline (1) in my next post. Here, I'll focus on how to implement Formaline (2) and include (and later extract) a source tar ball in/from an HDF5 data file. It turns out that this is extremely easy if you have the h5py and numpy Python packages installed.

Let's consider the case in which you have some standalone code that produces an HDF5 table called table.h5 and that is generated by some Fortran 90 code in and below directory src and that takes a top-level parameter file called parameters, a Makefile, a make.inc for machine-specific build flags, and a README file.

Here is the code that puts all source code into the HDF5 file:

So, after running this script, your HDF5 table table.h5 contains a dataset that holds the tar.gz ball of our code source, input, & parameter files. You can view the table of contents of any HDF5 file with h5ls [filename].

Now, let's say 5 years later, we find table.h5 and want to know how it was created. We can now go and extract the source/input/parameter tar.gz ball! Here is the script that does just that. It first creates an output directory called saved_code to make sure nothing is overwritten.

So that's a super easy way to make results reproducible -- just store the source code and all input/parameter files with the results! I think it's a no-brainer and we all should be doing it. HDF5 is a standard, well documented format that will be readable decades from now -- so your work is safe!

I wrote this implementation of Formaline for the NuLib neutrino interaction library and and a variant of it is bundled with NuLib and available on github at https://github.com/evanoconnor/NuLib.

In my next post, I'll talk about Formaline (1), which includes a source tar ball inside the executable. This is the original incarnation of Formaline that Erik Schnetter came up with.

Friday, April 15, 2016

A Beginner's Super Simple Tutorial for GNU Make

(GNU) Make has been around for a very long time, has been superseded (as some may argue) by CMake, but it is still an incredibly useful tool for... for what? For developing code on the command line. Yes, that's right. We still use editors and terminals to develop code. Integrated "windowy" development environments (IDEs) are great, but often overkill. And it's really hard to run them when you are developing remotely on a supercomputer.

But even a hard-core terminal shell & editor developer needs some help for productivity. Make is a key productivity tool for developing code.

Back in the old days (say just after punch cards were no more; way before my time), most people would write code in one big source file and then would just invoke the compiler to build the executable. So your entire code would be in a single file. Not very convenient for editing. Certainly not convenient for working collaboratively. When codes got bigger, it just became impractical. Compile times became huge, since every time a single line changed in one routine, the entire code needed to be compiled.

So people started breaking up their code into pieces, each of them living in its own source file. That's natural, since code is usually broken down into some kind of logical parts (e.g., subroutines and functions in procedural programming, classes/objects in object-oriented programming). So it makes sense to put logical units of the code into their own source files. But how do you get your executable built from an array of source files? How do you prevent that not everything has to be recompiled all the time?

This is where Make comes in!

Let's assume we are working with Fortran 90+ here and compile things using gfortran (if you like C better, just replace everything Fortran specific with C; or with whateva language you like). Let's say you have three source files.

main_program.F90
subroutine_one.F90
subroutine_two.F90

Note that I am using upper case F90, instead the more common f90. The upper case tells the compiler to pre-process the source files before compiling them. In this way, we can use preprocessor directives, see here if you are interested: https://en.wikipedia.org/wiki/Directive_(programming). We won't use these directives here, but perhaps I'll write a post about them in the future.

So that you can follow along easily, let's actually write these routines:

main_program.F90:

program myprogram

  implicit none

  call subroutine_one
  call subroutine_two

end program myprogram

subroutine_one.F90:

subroutine subroutine_one
  implicit none
  write(*,*) "Hi, I am subroutine one"
end subroutine_one

subroutine_two.F90: 

subroutine subroutine_two
  implicit none
  write(*,*) "Hi, I am subroutine two"
end subroutine_two

Naively, you could build your program this way:
gfortran -g -O3 -o myprogram main_program.F90 subroutine_one.F90 subroutine_two.F90
So every single time you compile, everything gets compiled. This produces an executable called myprogram. By the way, the options I am using in the compile line are: "-g" -- add debug symbols and "-O3" -- optimize aggressively (not really needed here, but I do it out of habit (I want fast code!)).

Recompiling is of course no problem if the routines don't do anything of substance as in our example. But let's for a moment assume that each of these routines does something incredibly complicated that takes minutes or longer to compile. You don't want to have to recompile everything if you made a change in only one of the source files. The trick to avoid this is to first build object files (file suffix ".o"), then to link them together and only remake those whose source files changed.

Let Make take care of figuring out what changed. You tell Make what to do by writing a makefile in which you specify what you want to make and what it depends on. Here is the makefile for our example. Call it Makefile, then Make will automatically find it when you type make on the command line. Here is what will be in your makefile:
FC=gfortran
FCFLAGS=-g -O3

myprogram: main_program.o subroutine_one.o subroutine_two.o
     $(FC) $(FCFLAGS) -o myprogram main_program.o subroutine_one.o subroutine_two.o

main_program.o: main_program.F90
     $(FC) $(FCFLAGS) -c main_program.F90

subroutine_one.o: subroutine_one.F90
     $(FC) $(FCFLAGS) -c subroutine_one.F90

subroutine_two.o: subroutine_two.F90
     $(FC) $(FCFLAGS) -c subroutine_two.F90

clean:
     rm -f *.o

There is a very clear structure to this. The first two lines are definitions of variables you are using later -- the compiler and the compiler flags (you could call these variables anything you want). Then you have lines that start with the target on the left, followed by a colon. After the colon, you list the things that the target depends on (the dependencies). This tells Make what needs to be done before working on the current target. So in our case, myprogram depends on a bunch of object files. The rules for making these sub-targets are specified subsequently. One important thing to remember: In the line where you tell Make what to do for a given target (once the dependencies are met and up-to-date), the first character on that line has to be a tabulator character (i.e., you have to hit your "tab" key). Only this gives the actual command the correct indentation (that's something historic, I guess). But clearly this is not a show stopper!

Note the "clean" target -- I added this so that you can easily clean things up.

So, for example, let's say you want to compile your code, but you want to compile it freshly from scratch. The first thing you do is say (on the command line) make clean. This will get rid of all previously compiled object files. Next you just type make and Make will compile and link everything for you! Now you go and work on one of the source files. You save it and then you want to recompile. Just type make. This time only the source file that you changed is recompiled and then linked together with the other object files to make the executable. Voilá!

I hope you see the benefits of using Make now! Make is of course much more powerful than this simple example can tell. You can read more about more advanced uses of Make here and here and here and at many other places that your favorite search engine will find for you.

Before I close, there is one more thing I want to mention: Make is also great for compiling LaTeX documents, in particular if you are using cross-referencing and a bibliography software like BibTeX.
Here is an example makefile for compiling a LaTeX document called paper.tex to paper.pdf with BibTeX. I assume that you have your BibTeX bibliography entries in a file called paper.bib:
paper.pdf: paper.tex paper.bib
    pdflatex paper.tex
    bibtex paper
    pdflatex paper.tex
    pdflatex paper.tex
This should work. Enjoy!