Friday, December 26, 2008

Parsing DirectX .X files

This has been by far the most demanding episode to write in these series of tutorials, but finally it's there. See http://code.google.com/p/fs-gamedev/wiki/Tutorial10 for the text and source code.

This tutorial shows how to parse .X files, which are used for 3d models and scenes. Note that DirectX offers convenient functions to manipulate these files, meaning that the practical usefulness of the code in this tutorial is limited, unless you don't have access to DirectX.

This tutorial demonstrates the following concepts:

* Discriminated unions and simple pattern matching
* Stateless lexing
* Hiding side effects in IO operations
* Error handling using the option type
* Defining mutually and self dependent types
* Building functions at run-time by combining basic functions

The content of this tutorial is significantly denser than usual. The really interesting part in my opinion resides however in the design decision to separate parsing in two stages. During the first stage, second-stage parsers are dynamically created by combining simple parsers (such as a semi-colon parser, an int parser, even a "zero" parser parsing nothing). Second-stage parsing simply consists of calling the result of first-stage parsing, which results in hierarchical data representing faces, lights and materials composing a 3d scene.

Sunday, November 9, 2008

Situation update

Tutorials
There has been little visible activity on the tutorials recently. That's because I have been busy implementing a small 3d object viewer. I have the somewhat unfortunate tendency to choose complexity over simplicity, which means that this viewer should be cross-platform. It is based on a generic scene-graph library and will support both DirectX (using SlimDX) and OpenGL (using the TAO framework).
It can display 3d objects stored in .X files, currently using the DirectX API. I have started a mini-project to write my own parser for it, as DirectX is limited to windows platforms.
I am starting to see the end of the tunnel, and the following tutorials should come soon:

- A generic scene graph library. It will demonstrate the use of some of F# objected-oriented constructs. It will also show how to use SlimDX with F#, which you can already see in snk_kid's examples.

- A parser for .X files. Writing this code proved more challenging than expected: .X files start by defining the data structures that are used later in the file to describe the data. See e.g. Direct-X File Format. This is a flexible but unusual approach: Data structures are normally fixed from file to file, and parsers can be written with this knowledge at hand. I found an elegant solution using a functional approach: The data structure declarations are parsed, generating new data parsers, which are used later when parsing the data itself.

XNA
I am staying clear of XNA for the time being. It would seem that my dream of getting my racing game to run on the XBox360 is not realizable, as the CLR on the XBox360 is notably slow when it comes to computations on floats, which are hardly avoidable in physics simulations.

Mono and games
Regarding performance and games, it seems Mono is overtaking Microsoft's CLR. Miguel de Icaza has announced that Mono now has experimental support for SIMD instructions. My racing game is currently implemented in C++, but does not take advantage of SIMD instructions. I am hopeful that a new clean implementation of my physics engine in F# may actually be faster than the current C++ one.

Friday, October 24, 2008

Tutorials moving to Google code

From now on, I will use the wiki on the Google code project for the tutorials. This has the advantage that readers will be able to see the history of corrections. Another good thing is that the wiki supports syntax highlighting. Blogger's software does not seem especially fit for posting code.

I think I will continue to use this blog for my thoughts and opinions about game programming and F#.

Sunday, October 19, 2008

Tutorial 11b: Efficient generic programming

Specialization is a programming technique used to improve the performance of generic code. It consists of specifying specific implementations for one or several specific instantiations of generic code.

Specialization using pattern matching

The "runge_kutta_4" function shown in these tutorials contains the following lines:

old + (1.0/6.0) * (v1 + 2.0 * (v2 + v3) + v4)


This is fine when dealing with floats, but with 3-dimensional vectors and larger types, it may be excessively slow due to the large amounts of intermediate values produced. Indeed, this code is the same as:

tmp1 = v2 + v3
tmp2 = 2.0 * tmp1
tmp3 = tmp2 + v4
tmp5 = v1 + tmp3
tmp6 = 1.0 / 6.0
tmp7 = tmp6 * tmp5
old + tmp7


Of these 7 intermediate values, 6 could be replaced by one.

What if we had a function to add 6 vectors creating only one value, the final result?

let add6 ((v0: Vec3d), (v1: Vec3d), (v2: Vec3d), (v3: Vec3d), (v4: Vec3d), (v5: Vec3d)) =
let x = v0.x + v1.x + v2.x + v3.x + v4.x + v5.x
let y = v0.y + v1.y + v2.y + v3.y + v4.y + v5.y
let z = v0.z + v1.z + v2.z + v3.z + v4.z + v5.z

Vec3d(x, y ,z)


Admittedly, it's not very pretty, and it would be nice if the user of "runge_kutta_4" was not forced to provide such a function.

F# has a parametric type called "option" useful to represent this kind of optional values. The C++ counter-part is to use a pointer, and let the null value denote the case when no value is provided.

The dictionary of operations is modified as follows:

type VecOps<'vector, 'scalar> = {
up: 'vector
add: 'vector * 'vector -> 'vector
add6: Add6<'vector> option
scale: 'scalar * 'vector -> 'vector
}


(By the way, another equivalent way of writing the declaration of add6 would be
add6: 'vector Add6 option
)

An instance of the dictionary using add6 is created as follows:

let Vec3d_ops = {
add = plus_Vec3d;
scale = scale_Vec3d;
add6 = Some add6;
up = Vec3d(1.0)
}


Or, to leave out add6:

let Vec3d_ops = {
add = plus_Vec3d;
scale = scale_Vec3d;
add6 = None;
up = Vec3d(1.0)
}


The last part of the "runge_kutta_4" function is modified to check if the dictionary of operations contains an "add6":

match vec_ops.add6 with
// No addMany available, use pair-wise additions (will create many intermediate vectors).
| None ->
old + (1.0/6.0) * (v1 + 2.0 * (v2 + v3) + v4)
// addMany available, use it.
| Some add ->
let tmp = add (v1, v2, v2, v3, v3, v4)
old + (1.0/6.0) * tmp


This is a new F# construct, called "pattern-matching". Basically, it means:

"Check the value of vec_ops.add6. If it is equal to None, then do the sum the usual way. otherwise, if the value of vec_ops.add6 is of the form Some add, then do the some by using 'add'".

I think pattern matching is a very nice construct. It allows you to match a value with fairly complex expressions called patterns. I won't dig deeper into patterns yet, let me just say that even though they look complex, the code generated by the compiler is as optimized as possible.

Optimization troubles

Unfortunately, not everything runs as fine as I had hoped. If I look at the generated code, I see that the optimizations that the F# compiler performed in the first part of Tutorial 11 have now disappeared. The problem seems to be that the compiler no longer sees through the record used for the dictionary of operations, and is therefore unable to inline the addition and scale operations. It is also unable to see if vec_ops.add6 is constantly "None" or if it has a constant function value.

As it turns out, the problem has nothing to do with pattern matching. Indeed, if I removed the "up" field from the record, or if use multiple parameters instead of a record, the generated code looks very nice, as shown below.

When an "add6" function is provided:



When no "add6" function is provided:




Was it worth it?
I get a 25% performance boost thanks to "add6" when using 3d vectors. On the other hand, I get a 65% performance drop when summing floats. This seems to be a clear case where tailoring generic code to specific types really helps to achieve the best performance.
There is a link to the full code available at the bottom of this page, if you want to try it out for yourself. I have made an habit of messing up my benchmark measurements, so don't take my word!

Conclusion

Time to conclude the part of these Tutorials dedicated to generic programming and low-level optimization.

To summarize:
1) Make sure your computations on floats are correct. NaN values kill performance.
2) Avoid curried functions when performance is critical, especially for methods.
3) Inline generic functions to avoid indirect calls to operations. However, do not take the habit of inlining all small functions. The compiler or the CLR do that for you.
4) Keep dictionaries of operations small, or avoid them altogether. Alternatives include static type constraints, shown in Tutorial 9, or function parameters.

Regarding item 4), I hope this is a problem that will be fixed in later releases of the F# compiler.

Source code
The complete source code used in this tutorial is available on google code (with comments, for a change).

Wednesday, October 15, 2008

Tutorial 11: Efficient generic programming

In Tutorial 8 I introduced a method for generic programmings which uses so-called dictionaries of operations (see also pp 110-111 in Expert F#). I concluded the article with the following judgment:

I'll close this up with a comment on performance. To put things bluntly, it's terrible. I am not quite sure if I have done something wrong, so I won't post scary numbers (yet), but I would say this method is not usable for real time physics simulations.


Bug fix

Well, it turned out that (1) I did do something wrong, and (2) there was a lot of untapped optimization potential in this code.

Before I go deeper into optimization techniques, let me correct a bug in the code:

let spring (vec_ops: VecOps<'a, 'b>) stiff pos =
vec_ops.scale stiff pos


should have been

let spring (vec_ops: VecOps<'a, 'b>) stiff pos =
vec_ops.scale -stiff pos


Without the minus sign, the simulated system quickly gains so much energy that the positions fall out of range. Apparently, floating point computations involving NaNs are very slow.

Faster, but still slow

With that problem fixed, the program now runs with the following execution times:

1D (struct)
Single (min, max, avg): 17.296000 17.382000 17.330000
Multi (min, max, avg): 8.244000 8.591000 8.412000
Single (min, max, avg): 78.617000 80.769000 79.340333
Multi (min, max, avg): 41.145000 41.741000 41.362333
2D (struct)
Single (min, max, avg): 18.918000 19.332000 19.158667
Multi (min, max, avg): 10.107000 10.411000 10.300667
Single (min, max, avg): 108.103000 110.466000 109.122333
Multi (min, max, avg): 56.108000 58.272000 57.140667
3D (struct)
Single (min, max, avg): 23.440000 23.683000 23.553667
Multi (min, max, avg): 12.671000 15.129000 13.726667
Single (min, max, avg): 112.160000 114.022000 113.266333
Multi (min, max, avg): 60.331000 65.190000 63.192667


Note that integration using 3-dimensional vectors is only ~1.5 times slower than when using 1-dimensional vectors, which is unexpected. The computations alone should really be 3 times slower. Where does the difference come from?

The time for the 1-dimensional computation is roughly fp + ovh, where fp is the time spend doing additions and multiplications on floats, and ovh is the overhead, i.e. time spent in calling functions and allocating objects.
Then the time for the 3-dimensional computation should be 3fp + ovh. For the sake of simplicity I'm assuming the overhead does not depend on the type of vector involved.
I've done the math for single-threaded RK4, and it turns out that the overhead time is about 78% of the total execution time.

It should be possible to do better, especially if we look at the performance of the code in Tutorial 9. Personally, I prefer the code in Tutorial 8, and would be glad if I could bring down the execution time to comparable levels.

Optimizations

A common (and valuable) advice on optimization techniques is to profile the code, using e.g. NProf. In the present case, however, NProf did not tell me anything I did not know.
At this point, .NET reflector comes in handy as it allows one to look at the generated IL code.

Let us look at the code for "ef_Vec1d", the Euler integration function using 1-dimensional vectors. It's not trivial to find, there is a bit of detective work to do. In the code for "_main", we find:



ef_Vec1d is a partially evaluated function, which is compiled into an object mimicking a function. This kind of object holds the evaluated parameters as data members and has an "Invoke" method which is the function itself.
There is a bit of overhead associated to partially evaluated functions, which means one should avoid them in performance critical sections of the code.
If we look at "clo@64_1", we will see that it lacks any data member. Moreover, the "Invoke" method contains:



Notice how the compiler is accessing Vec1D_ops directly, instead of using a data member or a function parameter. Would it be possible to get the compiler to do the same kind of optimization inside "euler_implicit"? In order for this to be possible, "euler_implicit" must be declared "inline". The new "clo@64" becomes:



The generated code is starting to be hard to read, but we can see that the body of "euler_implicit" was indeed inlined, and is now using Vec1D_ops directly. That is a nice optimization, but it would be better if the compiler had gone further and called "Vec1d.scale" directly. It might have even inlined "Vec1d.scale", as this method is very short.
What can we do to help the compiler do that? I had to take a guess here, but I suspected using a record for VecOps instead of a class could be a good idea.



That was a pretty good intuition. For some reason, classes prevent the F# compiler to "see through" them and remove them during optimization. Records, on the other hand, can be removed.

This piece of code, which is much more easily readable than the previous one, exposes a new problem:



Note that "plus" is not a function of two Vec1d returning a Vec1d. What it really is is a function of a single Vec1d, returning a FastFunc, which is itself an object mimicking a function. Every time "plus" is called, a new function object is created, which is a bad thing, performance wise. Happily, the solution is easy: Change the signature of plus to take two Vec1d and return a Vec1d. Instead of Vec1d -> Vec1d -> Vec1d



we want (Vec1d * Vec1d) -> Vec1d:



After applying the same change to "scale", we get the following performance:

1D (struct)
Single (min, max, avg): 1.959000 1.971000 1.964667
Multi (min, max, avg): 0.990000 1.004000 0.995000
Single (min, max, avg): 9.463000 9.509000 9.479667
Multi (min, max, avg): 4.939000 5.023000 4.975667
2D (struct)
Single (min, max, avg): 3.485000 3.522000 3.507333
Multi (min, max, avg): 1.783000 1.814000 1.794667
Single (min, max, avg): 20.306000 20.692000 20.468333
Multi (min, max, avg): 10.557000 10.657000 10.593000
3D (struct)
Single (min, max, avg): 6.089000 6.127000 6.104667
Multi (min, max, avg): 3.073000 3.201000 3.152333
Single (min, max, avg): 32.570000 32.616000 32.597333
Multi (min, max, avg): 16.424000 16.784000 16.657667
3D (record)
Single (min, max, avg): 6.910000 6.958000 6.940667
Multi (min, max, avg): 3.530000 3.575000 3.554667
Single (min, max, avg): 33.879000 34.373000 34.065000
Multi (min, max, avg): 17.662000 17.843000 17.758667


That was quite a performance improvement, wasn't it?
I have also implemented a 3d vector type using a record instead of a struct, in case it might magically improve performance. As you can see, it did not, there is no real difference with the struct.

These numbers could certainly be improved further by using the techniques described in Tutorial 7, but I'll stop here for now.


Complete source code

For the complete source code of the optimized version, see revision 8

To compare the changes with the code from Tutorial 8, see the diff.

Friday, October 10, 2008

Source code on Google code

Just a quick note to announce that I have created a google code project to host the source code used in this blog.
I have not yet added all tutorials, but the seven first are already there.

Saturday, September 27, 2008

Tutorial 10: Generic programming using quotations

The code for this tutorial comes in several files.
First: GenericPhysics.fs

#light

open System
open Microsoft.FSharp.Linq.QuotationEvaluation

#nowarn "57"


type ps<'Vec> = class
val pos:'Vec
val speed:'Vec
new(pos, speed) = { pos = pos; speed = speed }
end


type sa<'Vec> = class
val speed:'Vec
val accel:'Vec
new(speed, accel) = { speed = speed; accel = accel }
end


let mk_gravity scale_func (up: 'Vec): 'Vec =
let q =
<@
let (*) = %scale_func in -9.81 * up
@>
q.Eval()


let mk_spring scale_func: float -> 'Vec -> 'Vec =
let q =
<@
let (*) = %scale_func in fun k v -> -k * v
@>
q.Eval()


let mk_drag scale_func: float -> 'Vec -> 'Vec =
let q =
<@
let (*) = %scale_func in fun k v -> -k * v
@>
q.Eval()


let mk_euler_implicit plus_func scale_func =
let q =
<@
let (+) = %plus_func in
let (*) = %scale_func in
fun accel_func pos speed delta ->
let speed2 = speed + delta * (accel_func pos speed)
let pos2 = pos + delta * speed2
ps<_>(pos2, speed2)
@>
q.Eval()


let inline adapt (f: 'Vec -> 'Vec -> 'Vec) =
fun pos speed ->
let accel = f pos speed in
sa<_>(speed, accel)


let mk_rk4_q plus_func scale_func =
<@
fun deriv_func pos speed delta ->
let pf = %plus_func
let sf = %scale_func

let extrapolate x delta dx =
pf x (sf delta dx)

let half_delta = 0.5 * delta

let sa1: sa<_> = deriv_func pos speed
let sa2 = deriv_func (extrapolate pos half_delta sa1.speed) (extrapolate speed half_delta sa1.accel)
let sa3 = deriv_func (extrapolate pos half_delta sa2.speed) (extrapolate speed half_delta sa2.accel)
let sa4 = deriv_func (extrapolate pos delta sa3.speed) (extrapolate speed delta sa3.accel)

let sixth_delta = delta / 6.0

let update old v1 v2 v3 v4 =
pf old (sf sixth_delta
(pf v1
(pf v2
(pf v2
(pf v3
(pf v3
v4))))))

let pos1 = update pos sa1.speed sa2.speed sa3.speed sa4.speed
let speed1 = update speed sa1.accel sa2.accel sa3.accel sa4.accel

ps<'Vec>(pos1, speed1)
@>


let mk_rk4 plus_func scale_func =
let q = mk_rk4_q plus_func scale_func
q.Eval()


let simulate intg_func (t0: float) t_fin delta pos0 speed0 =
let rec repeat t0 pos0 speed0 =
if t0 > t_fin then (pos0, speed0)
else
let t1 = t0 + delta
let ps: ps<_> = intg_func pos0 speed0 delta
repeat t1 ps.pos ps.speed
repeat t0 pos0 speed0


Second file: FloatPhysics.fs

#light

let gravity = GenericPhysics.mk_gravity <@ fun x y -> x * y @> 1.0
let drag = GenericPhysics.mk_drag <@ fun x y -> x * y @>
let spring = GenericPhysics.mk_spring <@ fun x y -> x * y @>
let rk4 = GenericPhysics.mk_rk4 <@ fun x y -> x + y @> <@ fun x y -> x * y @>
let euler = GenericPhysics.mk_euler_implicit <@ fun x (y:float) -> x + y @> <@ fun x y -> x * y @>


Third file: Util.fs

#light

open System

let run_time_func (func: unit -> 'a) =
let watch = new System.Diagnostics.Stopwatch()
watch.Start()
let res = func ()
watch.Stop()
let diff0 = float watch.ElapsedMilliseconds / 1000.0
(diff0, res)


let map_parallel func items =
let tasks =
seq {
for i in items -> async {
return (func i)
}
}
Async.Run (Async.Parallel tasks)


Last file: Main.fs
#light

open System
open FloatPhysics
open Util

let inline accel pos speed =
drag 1.0 speed + spring 100.0 pos + gravity

let ef = (FloatPhysics.euler accel)
let rk4f = (rk4 (GenericPhysics.adapt accel))

let run euler_func s0 =
let t0 = 0.0
let t_fin = 1000.0
let delta = 0.005

let run_timed map_func =
let n_runs = 2
let times =
[
for i in 1..n_runs ->
let (run_time, res) =
run_time_func (fun () ->
s0 |> map_func(fun (x, y) ->
GenericPhysics.simulate euler_func t0 t_fin delta x y))
run_time
]

let max = Seq.max times
let min = Seq.min times
let avg = (Seq.fold (fun x y -> x+y) 0.0 times) / float n_runs
(min, max, avg)

let single_min, single_max, single_avg = run_timed List.map
let multi_min, multi_max, multi_avg = run_timed map_parallel

printfn "Single (min, max, avg): %f %f %f" single_min single_max single_avg
printfn "Multi (min, max, avg): %f %f %f" multi_min multi_max multi_avg


let n_bodies = 2

let s0_float = [ for s in 1..n_bodies -> (float s, 0.0) ]
run ef s0_float
run rk4f s0_float

ignore(Console.ReadKey(true))



If you want to compile these sources using Visual Studio, you need to know that the order in which the files appear in the project is important. If a file B uses types or values from a file A, then A must be placed higher in the list of files of the project.

You will also need to add references to the following DLLs: FSharp.PowerPack and FSharp.PowerPack.Linq.

Let us now take a deeper look at the new features of F# used in the code.

open Microsoft.FSharp.Linq.QuotationEvaluation

#nowarn "57"

The main item of this Tutorial is quotations. These represent pieces of code which can be constructed at compile time, changed dynamically at run time, and recompiled and executed dynamically. This last part is still experimental, and requires the use of a specific module.
The "#nowarn" directives tells the compiler not to warn us about this feature being experimental.

let mk_gravity scale_func (up: 'Vec): 'Vec =
let q =
<@
let (*) = %scale_func in -9.81 * up
@>
q.Eval()

The interesting piece here is the part between <@ and @>. The entire thing is called a quotation. It must contain a valid segment of F# code. The F# compiler will parse the code in this quotation, and build an object representing this code. More precisely, the compiler will generate an abstract syntax tree (AST). Note that this object does not in itself constitute executable code.
You may be wondering what "%scale_func" means in this context. The "%" operator specifies that another AST should be inserted there. "scale_func" is the variable which should contain the AST to insert.
To clarify this, look at the call to mk_gravity in FloatPhysics.fs:
let gravity = GenericPhysics.mk_gravity <@ fun x y -> x * y @> 1.0

The resulting quotation would therefore look as follows:

<@
let (*) = fun x y -> x * y in -9.81 * up
@>

which is has the same effect as:

<@
-9.81 * up
@>

An AST generated from a quotation can be compiled using method "Compile", or compiled and executed using method "Eval".
Had I written "q.Eval()" instead, "mk_gravity" would have returned a value of type "unit -> 'Vec", i.e. a function taking no argument and returning a vector when called. We may just as well returned the vector directly, which we can do by using method "Eval" instead.

If we take a step back and look at what we have just achieved, we will see that we have constructed a rather complicated way of perform the product of a vector by a scalar.

Let us look at the next function for a more interesting example.
let mk_spring scale_func: float -> 'Vec -> 'Vec =
let q =
<@
let (*) = %scale_func in fun k v -> -k * v
@>
q.Eval()

While I wrote this code I repeatedly found myself wondering "What did I just write?". Happily, the compiler can answer that question.This function takes an AST which should represent a function taking a float and a vector (in curried form) and returning a vector. The "mk_spring" function will then return a function taking a float and a vector and returning a vector. In other words, it will "unquote" the quotation, which is simply done by evaluating the AST. Before doing so, it inserts the quoted form of the function performing the product of a vector by a scalar into q, which in the present case has the same effect as:
<@
let (*) = fun x y -> x * y in fun k v -> -k * v
@>

Or in simpler terms:
<@
-k * v
@>

We have found yet another way to perform the product of a vector (which is actually a float in this simple example) by a scalar (another float, but not to be mistaken with the vector). The "mk_spring" function differs from "mk_gravity" in the way that its parameters are expected to vary. Gravity is assumed to be constant, and so is the upward direction, denoted by the "up" parameter of "mk_gravity". The stiffness coefficient and the position of the body, on the other hand, change across calls.

Using quotations to achieve genericity can be seen as some kind of hybrid between the functional approach described in Tutorial 8 and the approach based on static constraints of Tutorial 9. It shares with Tutorial 8 its use of function parameters to specify the implementation of operations on vectors. It share with Tutorial 9 its way of emitting specific code for each concrete type of vector.

I had hoped this would result in more efficient code, but that is unfortunately not currently the case. The quotation evaluation functionality is still experimental, and it may be possible that future versions of F# might produce faster code.

Update
I've come across an entry in Jomo Fisher's blog about Linq which makes me wonder if my code is really right. Jomo has used Linq in a setting where performance is truly impressive. Why I am getting such bad results here? Further investigation needed...
End of Update

The point of the approach described here is not so much what it achieves, but what possibilities it opens. For instance, it is possible to implement aggressive optimization techniques that the F# compiler will probably never support, such as rearranging code to avoid creating objects for intermediate values, or turning pure computations into computations with local side effects. Obviously, it would not be realistic to support all features of F#, but a subset dedicated to computations would probably suffice. Using Quotations.Expr from F# and Reflection.Emit from .NET, it seems very feasible to write a specialized optimizing compiler for a subset of F#.
Other interesting potential uses of quotations for games include programs to run on the GPU (shaders), or on any other kind of peripheral, for that matter (game controllers, head-on displays...).

Wednesday, September 24, 2008

Updated C++ version

This is the C++ version of the integrator, updated to use vectors.

UPDATE: Bug fixes. Funny how non-explicit single-argument constructors can hide errors in your code...


// RK++NET.cpp : main project file.

#include "stdafx.h"
#include <windows.h>

#include <map>
#include <vector>
#include <iostream>

const double gravity = -9.81;


template<typename T>
inline
T spring(const T& pos)
{
return -100.0 * pos;
}


template<typename T>
inline
T drag(double k, const T& pos)
{
return -k * pos;
}


template<typename A>
class EulerImplicit
{
public:
EulerImplicit(const A& accel):
accel(accel)
{
}

template<typename T>
inline
void operator()(T& pos, T& speed, const double delta) const
{
speed += delta * accel(pos, speed);
pos += delta * speed;
}

private:
const A& accel;
};


template<typename A>
class RK4
{
public:
RK4(const A& accel):
accel(accel)
{
}

template<typename T>
inline
void operator()(T& pos, T& speed, const double delta) const
{
const double half_delta = 0.5 * delta;

std::pair<T, T> k1 = accel(pos, speed);
std::pair<T, T> k2 = accel(pos + half_delta * k1.first, speed + half_delta * k1.second);
std::pair<T, T> k3 = accel(pos + half_delta * k2.first, speed + half_delta * k2.second);
std::pair<T, T> k4 = accel(pos + delta * k3.first, speed + delta * k3.second);

pos += delta / 6.0 * (k1.first + 2.0 * (k2.first + k3.first) + k4.first);
speed += delta / 6.0 * (k1.second + 2.0 * (k2.second + k3.second) + k4.second);
}

private:
const A& accel;
};


template<typename T>
class Force
{
public:
explicit Force(const T& up) : m_drag(1.0), m_gravity(gravity*up)
{
}

T operator()(const T& pos, const T& speed) const
{
return drag(m_drag, speed) + spring(pos) + m_gravity;
}

private:

const double m_drag;
T m_gravity;
};


template<typename A>
class Adapt
{
public:
Adapt(const A& accel):
accel(accel)
{
}

template<typename T>
inline std::pair<T, T> operator()(const T& pos, const T& speed) const
{
std::pair<T, T> ret;
ret.first = speed;
ret.second = accel(pos, speed);
return ret;
}

private:
const A& accel;
};


template<typename T, typename I>
inline
void simulate(const I& intg_func, T& pos, T& speed, double t0, const double t_fin, const double delta)
{
while(t0 < t_fin)
{
t0 += delta;
intg_func(pos, speed, delta);
}
}


template<int DIM>
class VecD
{
public:
VecD() {
for (int i=0; i<DIM; ++i)
vals[i] = 0.0;
}

explicit VecD(double v0, double v1=0.0, double v2=0.0)
{
if (DIM >= 1)
vals[0] = v0;

if (DIM >= 2)
vals[1] = v1;

if (DIM >= 3)
vals[2] = v2;

for (int i=3; i<DIM; ++i)
vals[i] = 0.0;
}

inline VecD<DIM> operator+=(const VecD<DIM>& other)
{
for (int i=0; i<DIM; ++i)
vals[i] += other.vals[i];
return *this;
}

inline VecD<DIM> operator+(const VecD<DIM>& other) const
{
VecD<DIM> tmp(*this);
return tmp += other;
}

public:
double vals[DIM];
};


template<int DIM>
inline
VecD<DIM>
operator*(double k, const VecD<DIM>& other)
{
VecD<DIM> ret(other);
for (int i=0; i<DIM; ++i)
ret.vals[i] *= k;
return ret;
}


typedef VecD<1> Vec1D;
typedef VecD<2> Vec2D;
typedef VecD<3> Vec3D;



void make_pos(int i, double& out) { out = (double)i; }
void make_pos(int i, Vec1D& out) { out = Vec1D((double) i); }
void make_pos(int i, Vec2D& out) { out = Vec2D(1.0, (double) i); }
void make_pos(int i, Vec3D& out) { out = Vec3D(1.0, (double) i, -1.0); }

void make_up(double &out) { out = 1.0; }
void make_up(Vec1D &out) { out = Vec1D(1.0); }
void make_up(Vec2D &out) { out = Vec2D(0.0, 1.0); }
void make_up(Vec3D &out) { out = Vec3D(0.0, 1.0, 0.0); }

template<typename Vec>
void run_sim(const char *descr)
{
typedef std::vector< std::pair<Vec, Vec> > Tuples;

std::cout << descr << std::endl;

const double t0 = 0.0;
const double t_fin = 1000.0;
const double delta = 0.005;
const int N_BODIES = 100;

Tuples initial_states;
initial_states.reserve(N_BODIES);
for (int i=0; i<N_BODIES; ++i)
{
Vec pos;
make_pos(i, pos);
initial_states.push_back( std::make_pair(pos, Vec(0.0)) );
}

double el = 0.0;
const int NRUNS=3;
for (int i=0; i<NRUNS; ++i)
{
Tuples states(initial_states);

Vec up;
make_up(up);
Force<Vec> accel(up);
EulerImplicit< Force<Vec> > euler(accel);
long int start0 = GetTickCount();
for (Tuples::iterator it = states.begin();
it != states.end();
++it)
{
simulate(euler, it->first, it->second, t0, t_fin, delta);
}
el += (GetTickCount() - start0) / 1000.0;
}
std::cout << "Time: " << el / NRUNS << "\n";
}

int main(int argc, char **argv)
{
run_sim<double>("double");
run_sim<Vec1D>("class 1D");
run_sim<Vec2D>("class 2D");
run_sim<Vec3D>("class 3D");

char c;
std::cin >> c;
return 0;
}

Tutorial 9: Generic programming using type constraints

This Tutorial 9 shows a generic implementation of the integrator, based on type parameters.

#light

open System


let run_time_func (func: unit -> 'a) =
let watch = new System.Diagnostics.Stopwatch()
watch.Start()
let res = func ()
watch.Stop()
let diff0 = float watch.ElapsedMilliseconds / 1000.0
(diff0, res)


let map_parallel func items =
let tasks =
seq {
for i in items -> async {
return (func i)
}
}
Async.Run (Async.Parallel tasks)

type ps<'num> = struct
val pos:'num
val speed:'num
new(pos, speed) = { pos = pos; speed = speed }
end

type VecN =
One of float
| Two of float*float
| Three of float*float*float
with
static member inline plus (a: VecN) (b: VecN) =
match (a, b) with
| (One(v), One(w)) -> One(v+w)
| (Two(v0, v1), Two(w0, w1)) -> Two(v0 + w0, v1 + w1)
| (Three(v0, v1, v2), Three(w0, w1, w2)) -> Three(v0 + w0, v1 + w1, v2 + w2)
| _ -> failwith "Size mismatch"

static member inline scale (k: float) (a: VecN) =
match a with
| One(v) -> One(k * v)
| Two(v0, v1) -> Two(k * v0, k * v1)
| Three(v0, v1, v2) -> Three(k * v0, k * v1, k * v2)


type Vec1d = struct
val x: float

new (x) = { x = x }

static member inline plus (a: Vec1d) (b: Vec1d) = Vec1d(a.x + b.x)
static member inline scale (k: float) (a: Vec1d) = Vec1d(k * a.x)
end


type Vec2d = struct
val x: float
val y: float

new (x, y) = { x = x; y = y }

static member inline plus (a: Vec2d) (b: Vec2d) = Vec2d(a.x + b.x, a.y + b.y)
static member inline scale (k: float) (a: Vec2d) = Vec2d(k * a.x, k * a.y)
end


type Vec3d = struct
val x: float
val y: float
val z: float

new (x, y, z) = { x = x; y = y; z = z }

static member inline plus (a: Vec3d) (b: Vec3d) = Vec3d(a.x + b.x, a.y + b.y, a.z + b.z)
static member inline scale (k: float) (a: Vec3d) = Vec3d(k * a.x, k * a.y, k * a.z)
end


type CVec1d (x: float) = class
member v.X = x

static member inline plus (a: CVec1d) (b: CVec1d) = CVec1d(a.X + b.X)
static member inline scale (k: float) (a: CVec1d) = CVec1d(k * a.X)
end


type CVec2d (x: float, y:float) = class
member v.X = x
member v.Y = y

static member inline plus (a: CVec2d) (b: CVec2d) = CVec2d(a.X + b.X, a.Y + b.Y)
static member inline scale (k: float) (a: CVec2d) = CVec2d(k * a.X, k * a.Y)
end


type CVec3d (x: float, y: float, z: float) = class
member v.X = x
member v.Y = y
member v.Z = z

static member inline plus (a: CVec3d) (b: CVec3d) = CVec3d(a.X + b.X, a.Y + b.Y, a.Z + b.Z)
static member inline scale (k: float) (a: CVec3d) = CVec3d(k * a.X, k * a.Y, k * a.Z)
end


let gravity =
-9.81


let inline spring (pos: ^vec) =
let (*) k x = (^vec: (static member inline scale: float -> ^vec -> ^vec) k) x
let k = 100.0
-k * pos


let inline drag k (pos: ^speed) (speed: ^speed) =
let (*) k x = (^vec: (static member inline scale: float -> ^vec -> ^vec) k) x
-k * speed


let inline euler_implicit accel_func (pos: ^vec) (speed: ^vec) delta =
let (+) x y = (^vec: (static member inline plus: ^vec -> ^vec -> ^vec) x) y
let (*) k x = (^vec: (static member inline scale: float -> ^vec -> ^vec) k) x

let speed2 = speed + (delta * (accel_func pos speed))
let pos2 = pos + (delta * speed2)
ps<_>(pos2, speed2)


let inline simulate intg_func t0 t_fin delta pos0 speed0 =
let oc = OptimizedClosures.FastFunc3<_,_,_,_>.Adapt(intg_func)
let rec repeat t0 pos0 speed0 =
if t0 > t_fin then (pos0, speed0)
else
let t1 = t0 + delta
let ps: ps<_> = oc.Invoke(pos0, speed0, delta)
repeat t1 ps.pos ps.speed
repeat t0 pos0 speed0


let inline accel (up: ^vec) (pos: ^vec) (speed: ^vec) =
let (+) x y = (^vec: (static member inline plus: ^vec -> ^vec -> ^vec) x) y
let (*) k x = (^vec: (static member inline scale: float -> ^vec -> ^vec) k) x
(drag 1.0 pos speed) + (spring pos) + (gravity * up)


let run euler_func s0 =
let t0 = 0.0
let t_fin = 1000.0
let delta = 0.005

let run_timed map_func =
let n_runs = 3
let times =
[
for i in 1..n_runs ->
let (run_time, res) = run_time_func (fun () -> s0 |> map_func(fun (x, y) -> simulate euler_func t0 t_fin delta x y))
run_time
]

let max = Seq.max times
let min = Seq.min times
let avg = (Seq.fold (fun x y -> x+y) 0.0 times) / float n_runs
(min, max, avg)

let single_min, single_max, single_avg = run_timed List.map
let multi_min, multi_max, multi_avg = run_timed map_parallel

printfn "Single (min, max, avg): %f %f %f" single_min single_max single_avg
printfn "Multi (min, max, avg): %f %f %f" multi_min multi_max multi_avg


let n_bodies = 10

let s0_Vec1d = [ for s in 1..n_bodies -> (Vec1d(float s), Vec1d(0.0)) ]
let ef_Vec1d = euler_implicit (accel (Vec1d(1.0)))

let s0_Vec2d = [ for s in 1..n_bodies -> (Vec2d(float s, 1.0), Vec2d(0.0, 0.0)) ]
let ef_Vec2d = euler_implicit (accel (Vec2d(0.0, 1.0)))

let s0_Vec3d = [ for s in 1..n_bodies -> (Vec3d(float s, 1.0, -1.0), Vec3d(0.0, 0.0, 0.0)) ]
let ef_Vec3d = euler_implicit (accel (Vec3d(0.0, 1.0, 0.0)))

let s0_CVec1d = [ for s in 1..n_bodies -> (CVec1d(float s), CVec1d(0.0)) ]
let ef_CVec1d = euler_implicit (accel (CVec1d(1.0)))

let s0_CVec2d = [ for s in 1..n_bodies -> (CVec2d(float s, 1.0), CVec2d(0.0, 0.0)) ]
let ef_CVec2d = euler_implicit (accel (CVec2d(0.0, 1.0)))

let s0_CVec3d = [ for s in 1..n_bodies -> (CVec3d(float s, 1.0, -1.0), CVec3d(0.0, 0.0, 0.0)) ]
let ef_CVec3d = euler_implicit (accel (CVec3d(0.0, 1.0, 0.0)))

let s0_One = [ for s in 1..n_bodies -> (One(float s), One(0.0)) ]
let ef_One = euler_implicit (accel (One(1.0)))

let s0_Two = [ for s in 1..n_bodies -> (Two(float s, 1.0), Two(0.0, 0.0)) ]
let ef_Two = euler_implicit (accel (Two(0.0, 1.0)))

let s0_Three = [ for s in 1..n_bodies -> (Three(float s, 1.0, -1.0), Three(0.0, 0.0, 0.0)) ]
let ef_Three = euler_implicit (accel (Three(0.0, 1.0, 0.0)))

printfn "1D (struct)"
run ef_Vec1d s0_Vec1d

printfn "2D (struct)"
run ef_Vec2d s0_Vec2d

printfn "3D (struct)"
run ef_Vec3d s0_Vec3d

printfn "1D (class)"
run ef_CVec1d s0_CVec1d

printfn "2D (class)"
run ef_CVec2d s0_CVec2d

printfn "3D (class)"
run ef_CVec3d s0_CVec3d

printfn "1D (union)"
run ef_One s0_One

printfn "2D (union)"
run ef_Two s0_Two

printfn "3D (union)"
run ef_Three s0_Three

ignore (Console.ReadKey(true))


Here come the detailed explanations:
type VecN =
One of float
[...]
type CVec3d (x: float, y: float, z: float) = class
member v.X = x
member v.Y = y
member v.Z = z

static member inline plus (a: CVec3d) (b: CVec3d) = CVec3d(a.X + b.X, a.Y + b.Y, a.Z + b.Z)
static member inline scale (k: float) (a: CVec3d) = CVec3d(k * a.X, k * a.Y, k * a.Z)
end

These definitions show different ways to implement vector types. The first method, which uses a union, is a bit uncommon. It is inelegant because it does not take advantage of the type system to capture the fact that adding vectors of different dimensions is not allowed. It is also the slowest implementation of vectors of all.
Follow definitions of one, two and three dimensional vectors using structs, which we already saw in the previous tutorial. The last group is almost identical, except for the fact that classes are used. I won't go into the differences between structs and classes in this post, but it's interesting to see that structs are faster in this program. This is however not generally true, even when considering only the special case of 3d vectors.

The interesting part starts below.
let inline spring (pos: ^vec) =

This defines a generic function "spring" taking a parameter of type "^vec". Note that "vec" here is not an actual type, but a type variable. We have already used the notation " 'vec", which plays a similar role. The difference resides in who creates the concrete function. In the case of " 'vec", this is handled by the generics mechanics in the .NET runtime. In the present case, using "^vec", it is handled by the F# compiler.
(QUESTION: is that correct? I tried replacing "^vec" with "'vec" in this line, and that worked too.)
This requires the function to be declared as inline. The compiler will generate a new copy of the code for each concrete type with which it is used.
For further details about type constraints, see Statically typed duck typing in F#.

let (*) k x = (^vec: (static member scale: float -> ^vec -> ^vec) k) x

I will start with the part of the line after the "=" symbol. It constitutes a type constraint, and means that "^vec" must have a static member named "scale", accepting a float and returning a function from "^vec" to "^vec" (in other words: a function taking a float and a "^vec" and returning a "^vec").
The rest of the line defines a function "*", thus overloading the multiplication operator, which allows us to write expressions in a natural way. Note that this method could also have been used in Tutorial 8, its availability is not subject to the use of type constraints.

let k = 100.0
-k * pos

Finally something we recognize...

You may have noticed that unlike Tutorial 8, this tutorial includes the optimizations from Tutorial 7 and 7b (aggressive inlining and OptimizedClosures). They do seem to make a difference here.

Update:
This Tutorial used to include a section comparing the performance of the F# and the C++ versions of Euler using doubles and structs (for 1-, 2- and 3-dimensional vectors). I noticed the timing results depended very much on whether the programs were started from Visual Studio or directly from Windows. C++ programs using .NET are faster when started from Visual studio, whereas F# programs run faster when started outside of Visual Studio. The differences were large enough to render comparisons meaningless, and I have therefore decided to remove the benchmark results until I have cleared that up.

I have kept the most original method to write generic code for the next Tutorial. Stay tuned!

Monday, September 22, 2008

Tutorial 8: Generic programming, functional approach

I have decided to continue this series by extending the integrator to handle 2d and 3d. One of the annoyances an amateur game developer encounters is the multiplication of 3d vector classes. They all use the same data layout, all offer the same methods, all offer the same performance, yet there is at least one such class per library. For instance, in my go-kart racing game, I have to deal with 3d vectors in OpenAL, 3d vectors in CGAL (which I used for DeLaunay triangulation), and should I start using ODE (a physics library), I'll have to deal with yet another type of 3d vectors. To make things worse, I am myself guilty of having implemented my own 3d vector math library.

The type inference feature in F# makes it easy to avoid introducing yet another 3d vector class. When the obligation to explicitly annotate all your variables with types disappears, writing polymorphic code becomes easy.

What approaches do we have to write generic code?
- the object-oriented approach based on interfaces, classes and method overriding,
- what I would call the functional approach,
- the template approach (in C++ terminology) also known as the generics approach (Java, .Net and friends terminology), and
- an approach which I have never seen yet (in compiled languages), which I will call the meta-programming approach.

I will skip over the first method, mostly because it has already abundantly been described and explained.
The second method is the subject of this Tutorial, and is also described in chapter 5 of Expert F#
The third method will be explored in the next Tutorial, and I am keeping the fourth method, the most exciting, for the last of the tutorials dedicated to generic programming.
Normally I would have started by showing the code, but in this case I thought I needed to justify why I should change the code at all. Especially if you come from C++, you might think that running the integration in 2d or 3d would just be a matter of changing the kind of vectors passed to the "simulate" and "accel" functions. As long as the new vector types support "operator+" and "operator*" we should be fine.
You would be wrong. Consider:
let x = 0.5 * a

I am not sure of the reasons behind this design decision, but in F#, this means that "a" and "x" are inferred to be the same type as 0.5, which is float. Moreover, the multiplication operator is now constrained to be applicable only on floats. That's right, you are no longer allowed to write "2 * 3". Like it or not, that's the way it is.

In practice, this means that I am probably best off avoiding operators + and *. Instead, the caller of my functions will have to tell them what functions to use to perform arithmetic operations.

OK, enough talk, here comes the code.

#light

open System


let run_time_func (func: unit -> 'a) =
let watch = new System.Diagnostics.Stopwatch()
watch.Start()
let res = func ()
watch.Stop()
let diff0 = float watch.ElapsedMilliseconds / 1000.0
(diff0, res)


let map_parallel func items =
let tasks =
seq {
for i in items -> async {
return (func i)
}
}
Async.Run (Async.Parallel tasks)


type Vec1d = struct
val x: float

new (x) = { x = x }

static member plus (a: Vec1d) (b: Vec1d) = Vec1d(a.x + b.x)
static member scale (k: float) (a: Vec1d) = Vec1d(k * a.x)
end


type Vec2d = struct
val x: float
val y: float

new (x, y) = { x = x; y = y }

static member plus (a: Vec2d) (b: Vec2d) = Vec2d(a.x + b.x, a.y + b.y)
static member scale (k: float) (a: Vec2d) = Vec2d(k * a.x, k * a.y)
end


type Vec3d = struct
val x: float
val y: float
val z: float

new (x, y, z) = { x = x; y = y; z = z }

static member plus (a: Vec3d) (b: Vec3d) = Vec3d(a.x + b.x, a.y + b.y, a.z + b.z)
static member scale (k: float) (a: Vec3d) = Vec3d(k * a.x, k * a.y, k * a.z)
end


type VecOps<'vector, 'scalar> = class
val up: 'vector
val add: 'vector -> 'vector -> 'vector
val scale: 'scalar -> 'vector -> 'vector

new (up, add, scale) = { up = up; add = add; scale = scale; }
end


let gravity (vec_ops: VecOps<'a, 'b>) =
vec_ops.scale -9.81 vec_ops.up


let spring (vec_ops: VecOps<'a, 'b>) stiff pos =
vec_ops.scale stiff pos


let drag (vec_ops: VecOps<'a, 'b>) k speed =
vec_ops.scale (-k) speed


let euler_implicit (vec_ops: VecOps<'a, 'b>) accel_func pos speed delta =
let speed2 = vec_ops.add speed (vec_ops.scale delta (accel_func pos speed))
vec_ops.add pos (vec_ops.scale delta speed2), speed2


let adapt (f: 'num -> 'num -> 'num) pos speed =
let accel = f pos speed
speed, accel


let runge_kutta_4 (vec_ops: VecOps<'a, 'b>) deriv_func pos speed delta =
let extrapolate x delta dx =
vec_ops.add x (vec_ops.scale delta dx)

let half_delta = 0.5 * delta

let s1, a1 = deriv_func pos speed
let s2, a2 = deriv_func (extrapolate pos half_delta s1) (extrapolate speed half_delta a1)
let s3, a3 = deriv_func (extrapolate pos half_delta s2) (extrapolate speed half_delta a2)
let s4, a4 = deriv_func (extrapolate pos delta s3) (extrapolate speed delta a3)

let sixth_delta = delta / 6.0

let update old v1 v2 v3 v4 =
vec_ops.add old
(vec_ops.scale sixth_delta
(vec_ops.add v1
(vec_ops.add v2
(vec_ops.add v2
(vec_ops.add v3
(vec_ops.add v3
v4))))))

let pos1 = update pos s1 s2 s3 s4
let speed1 = update speed a1 a2 a3 a4
pos1, speed1


let simulate intg_func t0 t_fin delta pos0 speed0 =
let rec repeat t0 pos0 speed0 =
if t0 > t_fin then (pos0, speed0)
else
let t1 = t0 + delta
let pos1, speed1 = intg_func pos0 speed0 delta
repeat t1 pos1 speed1
repeat t0 pos0 speed0


let run euler_func s0 =
let t0 = 0.0
let t_fin = 1000.0
let delta = 0.005

let run_timed map_func =
let n_runs = 1
let times =
[
for i in 1..n_runs ->
let (run_time, res) = run_time_func (fun () -> s0 |> map_func(fun (x, y) -> simulate euler_func t0 t_fin delta x y))
run_time
]

let max = Seq.max times
let min = Seq.min times
let avg = (Seq.fold (fun x y -> x+y) 0.0 times) / float n_runs
(min, max, avg)

let single_min, single_max, single_avg = run_timed List.map
let multi_min, multi_max, multi_avg = run_timed map_parallel

printfn "Single (min, max, avg): %f %f %f" single_min single_max single_avg
printfn "Multi (min, max, avg): %f %f %f" multi_min multi_max multi_avg


let Vec1D_ops = VecOps<Vec1d, float>(Vec1d(1.0), Vec1d.plus, Vec1d.scale)
let Vec2D_ops = VecOps<Vec2d, float>(Vec2d(0.0, 1.0), Vec2d.plus, Vec2d.scale)
let Vec3D_ops = VecOps<Vec3d, float>(Vec3d(0.0, 1.0, 0.0), Vec3d.plus, Vec3d.scale)

let accel (vec_ops: VecOps<'a, 'b>) (pos: 'a) (speed: 'a) =
vec_ops.add (spring vec_ops 100.0 pos)
(vec_ops.add (drag vec_ops 1.0 speed) (gravity vec_ops))

let n_bodies = 10

let s0_Vec1d = [ for s in 1..n_bodies -> (Vec1d(float s), Vec1d(0.0)) ]
let ef_Vec1d = euler_implicit Vec1D_ops (accel Vec1D_ops)
let rk4_Vec1d = runge_kutta_4 Vec1D_ops (adapt (accel Vec1D_ops))

let s0_Vec2d = [ for s in 1..n_bodies -> (Vec2d(float s, 1.0), Vec2d(0.0, 0.0)) ]
let ef_Vec2d = euler_implicit Vec2D_ops (accel Vec2D_ops)
let rk4_Vec2d = runge_kutta_4 Vec2D_ops (adapt (accel Vec2D_ops))

let s0_Vec3d = [ for s in 1..n_bodies -> (Vec3d(float s, 1.0, -1.0), Vec3d(0.0, 0.0, 0.0)) ]
let ef_Vec3d = euler_implicit Vec3D_ops (accel Vec3D_ops)
let rk4_Vec3d = runge_kutta_4 Vec3D_ops (adapt (accel Vec3D_ops))

printfn "1D (struct)"
run ef_Vec1d s0_Vec1d
run rk4_Vec1d s0_Vec1d

printfn "2D (struct)"
run ef_Vec2d s0_Vec2d
run rk4_Vec2d s0_Vec2d

printfn "3D (struct)"
run ef_Vec3d s0_Vec3d
run rk4_Vec3d s0_Vec3d

ignore(Console.ReadKey(true))


The code was somewhat cleaned up from previous tutorials, but no new concepts were added. I will never the less describe in further details my usage of structs.

type Vec2d = struct
val x: float
val y: float

new (x, y) = { x = x; y = y }

static member plus (a: Vec2d) (b: Vec2d) = Vec2d(a.x + b.x, a.y + b.y)
static member scale (k: float) (a: Vec2d) = Vec2d(k * a.x, k * a.y)
end

This defines a two-dimensional vector type supporting addition through the static method plus, and scalar product by a float through the static method scale. Nothing surprising there.

The next struct is more interesting:
type VecOps<'vector, 'scalar> = class
val up: 'vector
val add: 'vector -> 'vector -> 'vector
val scale: 'scalar -> 'vector -> 'vector

new (up, add, scale) = { up = up; add = add; scale = scale; }
end

Passing the addition and product methods to each function would be tedious, so I have defined a class to hold them. The first line defines a type called VecOps which is parametrized by two types, denoting respectively the vector type and the scalar type.
The second line is used to specify what is the upward direction. It's used only by function "gravity". A real integration library would normally not contain functions computing forces, since these are provided by the users. I know it's a bit clumsy to have this data member there.
The third and second lines define the type of the addition and scalar product operations.

It was a bit of a pain to add the extra parameter to each function manipulating vectors, but it's nothing unmanageable. This could have been averted by grouping all these functions into a class and passing the VecOps object to the constructor.
To avoid repeatedly passing VecOps objects to functions in the physics library, users can introduce short-cuts using function closures:
let ef_Vec1d = euler_implicit Vec1D_ops (accel Vec1D_ops)
let rk4_Vec1d = runge_kutta_4 Vec1D_ops (adapt (accel Vec1D_ops))


I'll close this up with a comment on performance. To put things bluntly, it's terrible. I am not quite sure if I have done something wrong, so I won't post scary numbers (yet), but I would say this method is not usable for real time physics simulations. I have un-inlined all functions, because (i) it makes little difference with respect to execution times, and (ii) the decisions to inline or not is best left to the .NET JIT.

There are however legitimate uses of the inline keyword, which will be shown in the next Tutorial.

Friday, September 19, 2008

Tutorial 7b: Further optimizations

The code for this tutorial shows the optimized version of Tutorial 7 by Jomo Fisher.
I am merely copy-pasting his changes and comments here, for the sake of completeness of this blog.

#light

open System

let run_time_func (func: unit -> 'a) =
let time0 = System.DateTime.Now
let res = func ()
let diff0 = System.DateTime.Now - time0
(diff0, res)

let map_parallel func items =
let tasks =
seq {
for i in items -> async {
return (func i)
}
}
Async.Run (Async.Parallel tasks)

type ps<'num> = struct
val pos:'num
val speed:'num
new(pos, speed) = { pos = pos; speed = speed }
end

type sa<'num> = struct
val speed:'num
val accel:'num
new(speed, accel) = { speed = speed; accel = accel }
end

let gravity =
-9.81

let inline spring pos =
let k = 100.0 in
-k * pos

let inline drag k pos speed = -k * speed

let inline euler_implicit accel_func pos speed delta =
let speed2 = speed + delta * accel_func pos speed in
ps<_>(pos + delta * speed2, speed2)

let inline adapt (f: 'num -> 'num -> 'num) =
fun pos speed ->
let accel = f pos speed in
sa<_>(speed, accel)

let inline runge_kutta_4 deriv_func pos speed delta =
let half_delta = 0.5 * delta
let sixth_delta = delta / 6.0
let sa1: sa<_> = deriv_func pos speed
let sa2 = deriv_func (pos + half_delta * sa1.speed) (speed + half_delta * sa1.accel)
let sa3 = deriv_func (pos + half_delta * sa2.speed) (speed + half_delta * sa2.accel)
let sa4 = deriv_func (pos + delta * sa3.speed) (speed + delta * sa3.accel)
let pos1 = pos + sixth_delta*(sa1.speed + 2.0 * sa2.speed + 2.0 * sa3.speed + sa4.speed)
let speed1 = speed + sixth_delta*(sa1.accel + 2.0 * sa2.accel + 2.0 * sa3.accel + sa4.accel) in
ps<_>(pos1, speed1)

let inline simulate intg_func t0 t_fin delta pos0 speed0 =
let oc = OptimizedClosures.FastFunc3<_,_,_,_>.Adapt(intg_func)
let rec repeat t0 pos0 speed0 =
if t0 > t_fin then (pos0, speed0)
else
let t1 = t0 + delta
let ps: ps<_> = oc.Invoke(pos0, speed0, delta)
repeat t1 ps.pos ps.speed
repeat t0 pos0 speed0


let initial_states = [ for s in 1..1000 -> (float s * 1.0, 0.0) ]
let t0 = 0.0
let t_fin = 1000.0
let delta = 0.025

let accel = fun pos speed -> (drag 1.0) pos speed + spring pos + gravity

let ef = (euler_implicit accel)
let rk4f = (runge_kutta_4 (adapt accel))

for i in 0..3 do
let (run_time, res) =
run_time_func (fun () ->
initial_states |> List.map(fun (x,y) ->
simulate ef t0 t_fin (0.25*delta) x y))
printfn "Euler single: %f" run_time.TotalSeconds;

let (run_time, res) =
run_time_func (fun () ->
initial_states |> map_parallel(fun (x,y) ->
simulate ef t0 t_fin (0.25*delta) x y))
printfn "Euler multi: %f" run_time.TotalSeconds;

let (run_time, res) =
run_time_func (fun () ->
initial_states |> List.map(fun (x,y) ->
simulate rk4f t0 t_fin delta x y))
printfn "RK4 single: %f" run_time.TotalSeconds;

let (run_time, res) =
run_time_func (fun () ->
initial_states |>
map_parallel(fun (x,y) ->
simulate rk4f t0 t_fin delta x y))
printfn "RK4 multi: %f" run_time.TotalSeconds;
done;

ignore(Console.ReadKey(true))


The "simulate" function now uses an optimized closure. Citing Jomo:
Closure optimization can be used when a function will be called many times. It allows F# to do a bit of expensive work up front that will later be amortized across the many calls to the function. Unfortunately, this is not something that you can stick everywhere and just get a benefit (otherwise, the compiler could just inject it for you). You should definitely profile before and after.



let inline simulate intg_func t0 t_fin delta pos0 speed0 =
let oc = OptimizedClosures.FastFunc3<_,_,_,_>.Adapt(intg_func)
let rec repeat t0 pos0 speed0 =
if t0 > t_fin then (pos0, speed0)
else
let t1 = t0 + delta
let ps: ps<_> = oc.Invoke(pos0, speed0, delta)
repeat t1 ps.pos ps.speed
repeat t0 pos0 speed0


My code made a rather extensive use of tuples, used for the returned values of "euler_implicit", "runge_kutta_4" and "adapt". Jomo has changed that to use structs instead. From what I understand, structs in F# are similar to structs in C#, and local variables of this type can be instantiated on the stack, instead of being created on the heap.
I'm still unsure about when to use structs and when not to, but I have noticed in my experiments that they do not always bring a performance boost. In other words, just replacing all your tuples (or other types I'll introduce later) is not always a good strategy.
The definition of the struct types is shown below.


type ps<'num> = struct
val pos:'num
val speed:'num
new(pos, speed) = { pos = pos; speed = speed }
end

type sa<'num> = struct
val speed:'num
val accel:'num
new(speed, accel) = { speed = speed; accel = accel }
end


This defines a struct "ps" parameterized by a type. "<'num>" declares the type parameter. Since the integrator in this tutorial is still limited to a single dimension, we could also have omitted the type parameter:


type ps = struct
val pos: float
val speed: float
new(pos, speed) = { pos = pos; speed = speed }
end


This defines a struct with two data members named pos and speed, and a constructor taking (oh surprise!) a position and a speed. The parts between curly braces means "create a value of type ps where the pos member is set to the parameter pos, and the speed member is set to the speed parameter".

Finally, Jomo also moved the creation of the integration functions out of the loop in the main program (which I obviously should have thought of...):

let ef = (euler_implicit accel)
let rk4f = (runge_kutta_4 (adapt accel))

let (run_time, res) =
run_time_func (fun () ->
initial_states |> List.map(fun (x,y) ->
simulate ef t0 t_fin (0.25*delta) x y))


The performance of the code has now increased to the numbers below.











ImplementationPerformance EulerPerformance RK4
Managed C++172413326530
F# Tutorial 6 66225111731
F# Tutorial 7 129032185185
JF single153846160000
JF multi297619318471



For the next tutorial, I am considering taking this integrator to the third dimension, or to draw a graph showing the trajectory of a particle. I have yet to decide.

Wednesday, September 17, 2008

C++ version of Tutorial 6

Below is the code of the C++ version of the Euler and RK4 integrators.


// RK++NET.cpp : main project file.

#include "stdafx.h"

using namespace System;
#include <map>
#include <vector>
#include <iostream>

const double gravity = -9.81;


template<typename T>
inline
T spring(const T& pos)
{
return -100.0 * pos;
}


template<typename T>
inline
T drag(double k, const T& pos)
{
return -k * pos;
}


template<typename A>
class EulerImplicit
{
public:
EulerImplicit(const A& accel):
accel(accel)
{
}

template<typename T>
inline
void operator()(T& pos, T& speed, const double delta) const
{
speed += delta * accel(pos, speed);
pos += delta * speed;
}

private:
const A& accel;
};


template<typename A>
class RK4
{
public:
RK4(const A& accel):
accel(accel)
{
}

template<typename T>
inline
void operator()(T& pos, T& speed, const double delta) const
{
const double half_delta = 0.5 * delta;

std::pair<T, T> k1 = accel(pos, speed);
std::pair<T, T> k2 = accel(pos + half_delta * k1.first, speed + half_delta * k1.second);
std::pair<T, T> k3 = accel(pos + half_delta * k2.first, speed + half_delta * k2.second);
std::pair<T, T> k4 = accel(pos + delta * k3.first, speed + delta * k3.second);

pos += delta / 6.0 * (k1.first + 2.0 * (k2.first + k3.first) + k4.first);
speed += delta / 6.0 * (k1.second + 2.0 * (k2.second + k3.second) + k4.second);
}

private:
const A& accel;
};


template<typename T>
class Force
{
public:
Force() : m_drag(1.0)
{}

T operator()(const T& pos, const T& speed) const
{
return drag(m_drag, speed) + spring(pos) + gravity;
}

private:
const double m_drag;
};


template<typename A>
class Adapt
{
public:
Adapt(const A& accel):
accel(accel)
{
}

template<typename T>
inline std::pair<T, T> operator()(const T& pos, const T& speed) const
{
std::pair<T, T> ret;
ret.first = speed;
ret.second = accel(pos, speed);
return ret;
}

private:
const A& accel;
};


template<typename T, typename I>
inline
void simulate(const I& intg_func, T& pos, T& speed, double t0, const double t_fin, const double delta)
{
while(t0 < t_fin)
{
t0 += delta;
intg_func(pos, speed, delta);
}
}


typedef std::vector< std::pair<double, double> > Tuples;


int main(array<System::String ^> ^args)
{
const double t0 = 0.0;
const double t_fin = 1000.0;
const double delta = 0.025;


Tuples initial_states;
initial_states.reserve(1000);
for (int i=0; i<1000; ++i)
initial_states.push_back( std::make_pair((double)i, 0.0) );

Tuples states(initial_states);

System::DateTime start0 = System::DateTime::Now;
Force<double> accel;
EulerImplicit< Force<double> > euler(accel);
for (Tuples::iterator it = states.begin();
it != states.end();
++it)
{
simulate(euler, it->first, it->second, t0, t_fin, 0.25 * delta);
}
System::TimeSpan diff0 = System::DateTime::Now - start0;
std::cout << "Time: " << diff0.TotalSeconds;
for (int i = 0; i < 20; ++i)
{
std::cout << " state = (" << states.at(i).first
<< "; " << states.at(i).second << ")" << std::endl;
}

states = initial_states;
start0 = System::DateTime::Now;
Adapt< Force<double> > adapted(accel);
RK4< Adapt< Force<double> > > rk4(adapted);
for (Tuples::iterator it = states.begin();
it != states.end();
++it)
{
simulate(rk4, it->first, it->second, t0, t_fin, delta);
}
diff0 = System::DateTime::Now - start0;
std::cout << "Time: " << diff0.TotalSeconds;
for (int i = 0; i < 20; ++i)
{
std::cout << " state = (" << states.at(i).first
<< "; " << states.at(i).second << ")" << std::endl;
}

Console::ReadKey(true);
return 0;
}

Monday, September 15, 2008

Tutorial 7: Concurrency

It seems that every tutorial on concurrency has to cite The Free Lunch is Over: A Fundamental Turn Toward Concurrency in Software by Herb Sutter, and I won't make an exception. Basically, we are not going to get faster CPUs in the future. What we are going to get is more CPUs. The challenge for programmers is then to write software which can take advantage of the extra CPUs.

The code of this Tutorial shows how to achieve that in F#, and how easy it is.

#light

open System

let run_time_func (func: unit -> 'a) =
let time0 = System.DateTime.Now
let res = func ()
let diff0 = System.DateTime.Now - time0
(diff0, res)

let map_parallel func items =
let tasks =
seq {
for i in items -> async {
return (func i)
}
}
Async.Run (Async.Parallel tasks)

let gravity =
-9.81

let inline spring pos =
let k = 100.0 in
-k * pos

let inline drag k pos speed = -k * speed

let inline euler_implicit accel_func pos speed delta =
let speed2 = speed + delta * accel_func pos speed in
pos + delta * speed2, speed2

let inline adapt (f: 'num -> 'num -> 'num) =
fun pos speed ->
let accel = f pos speed in
speed, accel

let inline runge_kutta_4 deriv_func pos speed delta =
let half_delta = 0.5 * delta
let s1, a1 = deriv_func pos speed
let s2, a2 = deriv_func (pos + half_delta * s1) (speed + half_delta * a1)
let s3, a3 = deriv_func (pos + half_delta * s2) (speed + half_delta * a2)
let s4, a4 = deriv_func (pos + delta * s3) (speed + delta * a3)
let pos1 = pos + delta/6.0*(s1 + 2.0*s2 + 2.0*s3 + s4)
let speed1 = speed + delta/6.0*(a1 + 2.0*a2 + 2.0*a3 + a4) in
pos1, speed1

let inline simulate intg_func t0 t_fin delta pos0 speed0 =
let rec repeat t0 pos0 speed0 =
if t0 > t_fin then (pos0, speed0)
else
let t1 = t0 + delta
let pos1, speed1 = intg_func pos0 speed0 delta
repeat t1 pos1 speed1
repeat t0 pos0 speed0


let initial_states = [ for s in 1..1000 -> (float s * 1.0, 0.0) ]
let t0 = 0.0
let t_fin = 1000.0
let delta = 0.025

let accel pos speed = (drag 1.0) pos speed + spring pos + gravity in
for i in 0..3 do
let (run_time, res) = run_time_func (fun () -> initial_states |>
List.map(fun (x,y) -> simulate (euler_implicit accel) t0 t_fin (0.25*delta) x y))
printfn "Euler single: %f" run_time.TotalSeconds;

let (run_time, res) = run_time_func (fun () -> initial_states |>
map_parallel(fun (x,y) -> simulate (euler_implicit accel) t0 t_fin (0.25*delta) x y))
printfn "Euler multi: %f" run_time.TotalSeconds;

let (run_time, res) = run_time_func (fun () -> initial_states |>
List.map(fun (x,y) -> simulate (runge_kutta_4 (adapt accel)) t0 t_fin delta x y))
printfn "RK4 single: %f" run_time.TotalSeconds;

let (run_time, res) = run_time_func (fun () -> initial_states |>
map_parallel(fun (x,y) -> simulate (runge_kutta_4 (adapt accel)) t0 t_fin delta x y))
printfn "RK4 multi: %f" run_time.TotalSeconds;
done;

ignore(Console.ReadKey(true))


There are two changes in this code compared to Tutorial 6. One of them makes it easier to measure execution times. The other is the introduction of asynchronous computations. I will start with the description of the latter.

let map_parallel func items =
let tasks =
seq {
for i in items -> async {
return (func i)
}
}
Async.Run (Async.Parallel tasks)

This defines a function "map_parallel" taking a function and a list of items as parameters. It is similar to "Array.map", which was used in Tutorial 6: it applies the function "func" on each item of the list, and produces a list of the results. What is new here is it does so not in a sequential manner, but in an asynchronous way.

The "map_parallel" function starts by defining a sequence of tasks. The syntax
seq { for i in items -> Expr }
produces a sequence of items whose value is the result of evaluating Expr for each item.

async { return (func i) }
This defines an asynchronous computation, i.e. a computation that can be executed asynchronously. At this point it may seem that actually executing all these computations in parallel and wait for their completion is a complicated thing to do. Happily, there are functions in the standard library of F# that handle all the tricky business of creating threads, starting them, distributing computations on these threads and waiting for their completion.

Async.Run (Async.Parallel tasks)
I don't fully understand what these two functions exactly do, I'm happy enough that they do what I want when composed.

All there is to do is replace the call to Array.map by parallel_map:
map_parallel(fun (x,y) -> simulate (euler_implicit accel) t0 t_fin (0.25*delta) x y)


That's it, the performance numbers have now increased, running on a dual-core (although by less than a factor of two). Below is the updated performance table.


ImplementationPerformance EulerPerformance RK4
C++172413326530
F# Tutorial 6 66225111731
F# Tutorial 7 129032185185



The rest is just utility code intended to make it easier to measure execution times.

let run_time_func (func: unit -> 'a) =
let time0 = System.DateTime.Now
let res = func ()
let diff0 = System.DateTime.Now - time0
(diff0, res)
This defines a function "run_time_func" which takes a function of zero argument and returns a value of some type "a". Notice the use of "unit" to denote "no argument", which is similar to the "void" type of C.
"func ()" invokes this function. "func" alone would have denoted the function itself, instead of calling it.

In order to prevent calling the "simulate" function in the main code, I have added "fun () -> ", thus wrapping the call to "simulate" inside a function taking no argument.
run_time_func (
fun () ->
initial_states |>
List.map(fun (x,y) -> simulate (euler_implicit accel) t0 t_fin (0.25*delta) x y)
)

It is this wrapper that is passed to "run_time_func", and its execution results in the call to "simulate".

The "run_time_func" function returns a tuple composed of the execution time and the result of the function call.

Sunday, September 14, 2008

Tutorial 6: Inlining

As usual, code first:
#light

open System

let gravity =
    -9.81

let inline spring pos =
    let k = 100.0 in
        -k * pos

let inline drag k pos speed = -k * speed

let inline euler_implicit accel_func pos speed delta =
    let speed2 = speed + delta * accel_func pos speed in
        pos + delta * speed2, speed2

let inline adapt (f: 'num -> 'num -> 'num) =
    fun pos speed ->
        let accel = f pos speed in
            speed, accel

let inline runge_kutta_4 deriv_func pos speed delta =
    let half_delta = 0.5 * delta
    let s1, a1 = deriv_func pos speed
    let s2, a2 = deriv_func (pos + half_delta * s1) (speed + half_delta * a1)
    let s3, a3 = deriv_func (pos + half_delta * s2) (speed + half_delta * a2)
    let s4, a4 = deriv_func (pos + delta * s3) (speed + delta * a3)
    let pos1 = pos + delta/6.0 * (s1 + 2.0 * s2 + 2.0 * s3 + s4)
    let speed1 = speed + delta/6.0 * (a1 + 2.0 * a2 + 2.0 * a3 + a4) in
        pos1, speed1

let inline simulate intg_func t0 t_fin delta pos0 speed0 =
    let rec repeat t0 pos0 speed0 =
        if t0 > t_fin then (pos0, speed0)
        else
            let t1 = t0 + delta
            let pos1, speed1 = intg_func pos0 speed0 delta
            repeat t1 pos1 speed1
    repeat t0 pos0 speed0

let initial_states = [| for s in [1..1000] -> (float s, 0.0) |] 
let t0 = 0.0
let t_fin = 1000.0
let delta = 0.025

let accel pos speed = (drag 1.0) pos speed + spring pos + gravity in

let time0 = System.DateTime.Now
let final_states = initial_states |>
Array.map (fun (x,y) -> simulate (euler_implicit accel) t0 t_fin (0.25*delta) x y)
let diff0 = System.DateTime.Now - time0
printfn "Time: %f Result: %A" diff0.TotalSeconds final_states;

let time0 = System.DateTime.Now
let final_states = initial_states |>
Array.map (fun (x,y) -> simulate (runge_kutta_4 (adapt accel)) t0 t_fin delta x y)
let diff0 = System.DateTime.Now - time0
printfn "Time: %f Result: %A" diff0.TotalSeconds final_states

ignore(Console.ReadKey(true))


This code differs from the code in Tutorial 5 in the following ways:
- Some functions were declared "inline".
- The adapted function required by the RK4 integrator is constructed at the top level.
- The simulation is run on 1000 bodies instead of one.

The "inline" keyword in C++ instructs the compiler to emit the code of the function in place where it is called, instead of using a jump. I guess the effect is the same in F#. Which functions should be inlined is hard to tell. Small functions are obvious candidates. It also appears that F# automatically inlines some functions, even though they are not marked with "inline". For instance, "accel" was inlined inside the code for "runge_kutta_4". It was however not inlined inside "euler_implicit", which may explain why RK4 ended up faster than Euler (see the table further down).

The adapted function is now built once at the top level, instead of being repeatedly recreated inside "runge_kutta_4". It now seems obvious that this is the right way to do it, but if you look back to Tutorial 4 you may understand how I came to doing it the stupid way. Writing code in an incremental fashion, adding small bits of functionality after another often leads to suboptimal code, but that is hardly avoidable. This underlines how important it is for programming languages to support code refactoring in an easy and safe manner.

I modified the code to run a simulation of 1000 bodies instead of just one. My goal was to introduce concurrency, but I'll leave that for later. Here comes the explanations for the new code:
let initial_states = [| for s in [1..1000] -> (float s, 0.0) |]

If you are familiar with Python you will probably have recognized a comprehension. Comprehensions are also supported by F#. This code declares a variable "initial_states" which is an array of pairs of floats, where the first member denotes the initial position, and the second the initial speed. The new bodies are initially immobile and located one on top of another, separated by 1 meter.
Notice how the integer "s" must be explicitly converted to a float. In F#, integers cannot be multiplied by floats.

let final_states = initial_states |>
    Array.map (fun (x,y) -> simulate (euler_implicit accel) t0 t_fin (0.25*delta) x y)
The array of initial states is passed to the Array.map function, which takes a function and applies it to each item in the array, effectively producing a new array.
These lines result in the application of the simulate function to each initial state.

The "|>" operator is just syntactical sugar. This code could have been written:
let final_states =
    Array.map (fun (x,y) -> simulate (euler_implicit accel) t0 t_fin (0.25*delta) x y)
    initial_states


At the end of Tutorial 5, we had managed to write code that was generic, concise and hopefully correct. Sadly, it was also rather slow. How much faster is it now? See the table below.


ImplementationPerformance EulerPerformance RK4
C++172413326530
F# Tutorial 53015632425
F# Tutorial 6 66225111731
OCaml6578971428



Numbers in the performance columns are (numbers of seconds simulated) x (number of bodies) / (running time). They are an indication of how many computations per second each implementation achieved.
Note that the simulation based on Euler uses a time step four times smaller than that of RK4. I think that is fair, since Euler evaluates the acceleration function once each step, instead of four times for RK4.

The C++ implementation is generic (using templates, no virtual method was used) and uses side-effects. I also had a non-generic version which was about twice as fast, but I deleted it (stupid me).
The OCaml implementation is almost identical to the code in Tutorial 5, which means it is both generic and pure (no side effects).
The good performance of RK4 in F# is somewhat artificial and probably untypical of a real application. Indeed, the acceleration function used in this example is excessively simple.

The C++ code is 186 lines long (including empty lines), the F# code is 60 lines long. I also find the F# code much more readable. Compare with the code of RK4 in C++
template<typename A>
class RK4
{
public:
    RK4(const A& accel):
       accel(accel)
    {
    }

    template<typename T>
    inline
    void operator()(T& pos, T& speed, const double delta) const
    {
        const double half_delta = 0.5 * delta;

        std::pair k1 = accel(pos, speed);
        std::pair k2 = accel(pos + half_delta * k1.first, speed + half_delta * k1.second);
        std::pair k3 = accel(pos + half_delta * k2.first, speed + half_delta * k2.second);
        std::pair k4 = accel(pos + delta * k3.first, speed + delta * k3.second);

        pos += delta / 6.0 * (k1.first + 2.0 * (k2.first + k3.first) + k4.first);
        speed += delta / 6.0 * (k1.second + 2.0 * (k2.second + k3.second) +     k4.second);
    }

private:

    const A& accel;
};


What should we conclude? I personally think that the 3x performance penalty of F# is acceptable, considering how uncompromising my F# code is regarding purity and genericity. It would be interesting to see how fast the F# code rewritten to use side effects would run.
This little experiment was enough to convince me to move over to F#. I stuck to C++ because of its support for generic programming (using templates), static type checking and for the large amount of code available. F# supports generic programming using a much cleaner syntax, its IDE performs type checks while the code is being typed in, and it has access to the .NET framework.
A last interesting fact to report: I wrote the generic C++ code after I wrote the F# code, without thinking very much. It was easy to translate the F# code to C++. It would have been considerably harder to write the C++ code from scratch. As a matter of fact, I had no clear idea how to implement RK4 in a generic way before I tackled the problem "thinking" in F#. UPDATE (may 2012): The numbers for C++ were actually for managed C++. After further optimizations (see tutorial 7b), F# went ahead of the managed C++ version. I later wrote a native C++ version which if I remember correctly was about 10 times faster.