Friday, May 18, 2012

Recursive descent parser using active patterns

Today's post isn't specifically about games, but about parsing, which I find is a recurring task in many programming tasks, including game-related tasks.

In F#, the most popular methods for writing parsers are FParsec and fslex/fsyacc. Although parser generators are very useful, I'm always a bit reluctant to depend on third-party software.

In the past I have worked on a development tool for safety-critical systems, which was itself subject to some of the limitations of software for safety-critical systems. In particular, the use of lex and yacc was not accepted.

I fell back on a technique suitable for hand-written parsers, namely recursive descent parsers. I was positively surprised by the inherent simplicity of the method and the clarity of code that resulted.

I was first exposed to the idea of using active patterns for parsing when I read a blog post by Jon Harrop.
Active patterns is a feature specific to F#, and the F# wikibook has a chapter dedicated to them. I'll be using parameterized partial active patterns.

We start by defining a type for the input. The integer denotes the number of lines read so far, the list of strings is the input split at line-breaks.
open System.Text.RegularExpressions

type Stream = Stream of int * string list

A pattern to detect the end of the input, when the list of strings is empty.
let (|EOF|_|) = function
    | Stream(_, []) -> Some()
    | _ -> None

A pattern to detect end-of-lines, by recognizing list of strings which start with the empty string.
let (|EOL|_|) = function
    | Stream(n, "" :: rest) ->
        Some (Stream(n + 1, rest))
    | _ -> None
This pattern eats white space.
let (|OneWS|_|) = function
    | Stream(n, line :: rest) ->
        let trimmed = line.TrimStart()
        if trimmed <> line then
            Some (Stream(n, trimmed :: rest))
        else
            None
    | _ -> None
A convenient pattern to eat sequences of white space and newlines. Note the use of the rec keyword, which allows to refer to the pattern itself in its implementation.
let rec (|WS|_|) = function
    | OneWS (WS s)
    | EOL (WS s) -> Some s
    | s -> Some s
We could also have attempted another variant, where WS uses itself for the first part, which would implement a left-recursive grammar. Unfortunately, this pattern would be useless, as all it would do is raise stack-overflow exceptions.
let rec (|WS1|_|) = function
    | WS1 (OneWS s) -> Some s
    | s -> Some s
My variant of the Regex pattern differs a bit from the one in the wikibook, to avoid re-building Regex objects, which has a non-neglectable cost.
let (|Regex|_|) (regex : Regex) = function
    | Stream(n, line :: rest) ->
        let m = regex.Match(line)
        if m.Success then
            let lineRest = line.Substring(m.Length)
            let values =
                [ for gr in m.Groups -> gr.Value ]
            Some (values, Stream(n, lineRest :: rest))
        else None
    | _ -> None
A sample pattern to illustrate the use of the Regex pattern.
let personRegex = Regex("NAME: (\w+) AGE: (\d+)")
let (|Person|_|) = function
    | WS (Regex personRegex ([_ ; name ; age], s)) -> Some ((name, age), s)
    | _ -> None
A pattern to parse data of a large number of persons. I could have used recursion at the level of the pattern itself, similarly to WS. However, such a pattern would not be tail-recursive, and would cause stack overflows for large numbers of entries.
let (|Persons|_|) s =
    let rec work s persons =
        match s with
        | Person (p, s) -> work s (p :: persons)
        | _ -> (persons |> List.rev, s)
    let persons, s = work s []
    Some (persons, s)
Finally, some utility code to initialize streams, generate one million lines to parse, followed by the parsing itself.
let streamOfLines lines =
    Stream(0, lines |> List.ofSeq)

let streamOfFile path =
    Stream(0, System.IO.File.ReadAllLines(path) |> List.ofArray)

let stream =
    seq { for n in 0..1000000 do
            yield sprintf "  NAME: Anonymous AGE: %d              " (n % 99) }
    |> streamOfLines

#time "on"
match stream with
| Persons (p, WS EOF) -> printfn "Success %A" p
| _ -> printfn "Failed"
#time "off"
Parsing takes about 9 seconds on my PC (18s if I recreate the Regex object repeatedly). I think that's very decent performance.
In a future post, I'll describe how to parse Boolean formulas and expand a bit on the power and performance of this method.

2 comments:

Dave Thomas said...

Would you consider this approach viable for parsing say language files, like clojure source etc?

Johann Deneux said...

This would not be my method of choice for anything but very simple grammars. The problem is that bugs are hard to find. For instance, you can't tell if the grammar is ambiguous, whereas a parser generator can't tell you about that (shift/reduce conflicts...)

Diagnosing why the parser rejects a string you think should be accepted can be a bit difficult too (but then, it's always tricky regardless of the method anyway).

Antlr is what we use at work. We also use active patterns to consume the tree produced by Antlr and produce a typed abstract syntax tree.