Compiler Series Part II: Compiler Basics
This is the second part in a series of articles on implementing a custom .NET compiler for a card game language called CGL. The project page for CGL, and a list of all the articles in the series can be found here.
In this article I'll introduce the concepts used as building blocks for the compiler and how they can be composed to match complex textual patterns. At the end of this article we'll be able to parse ordered bits of text and turn them into managed objects.
There are a number of ways to approach creating a compiler, but the one we'll be exploring here is a simple recursive decent parser where the grammar for the language is encoded in C# code rather than being described using formats such as EBNF. This is not really a theoretical series so I won't be going into detail on the type of compiler or discuss topics such as formal semantics, just the practical bits.
The main concept I'm building my compiler around is a delegate for parsing source code which returns a parse result of an arbitrary C# type. A number of these can then be composed together to match complex patterns and create a whole object graph. This concept is not something I've invented myself, it's based on a blog post by Louis DeJardin, the guy behind the Spark View Engine for ASP.NET MVC. This post is very similar to Louis', which can be found here, but in the next article and forward we use it for our own specialized grammar.
To get started we'll define the interfaces of the basic concepts we're going to need:
- public delegate ParseResult<TValue> ParseAction<TValue>(Position position);
- public class ParseResult<TValue>
- {
- public Position Rest { get; }
- public TValue Value { get; }
- public ParseError Error { get; }
- }
- public class Position
- {
- public int Line { get; }
- public int Column { get; }
- public string Context { get; }
- public char Peek();
- public string Peek(int length);
- public Position Advance(int offset);
- }
- public class ParseError
- {
- public int Line { get; }
- public int Column { get; }
- public string Message { get; }
- }
The ParseAction delegate is a function which, given a position in the source code, yields either a result of the generic type or a parse error if the pattern did not match. The result of the parser is stored in a ParseResult object which hold either of the possible outcomes. It also contains the position in the source code where a subsequent parser would start off from. The Position object is an immutable description of a point in the source code and it can potentially wrap a source that is more complex than a string if needed.
Using these classes we can begin to create a Grammar class where we define the most basic patterns we want to match.
- public class Grammar
- {
- public static ParseAction<char> Ch(char match)
- {
- return delegate(Position input)
- {
- if (input.Peek() == match)
- return new ParseResult<char>(input.Advance(1), match);
- return new ParseResult<char>("Encountered unexpected char", input.Line, input.Column);
- };
- }
- public static ParseAction<IList<TValue>> Rep<TValue>(ParseAction<TValue> parser)
- {
- return delegate(Position input)
- {
- Position rest = input;
- List<TValue> list = new List<TValue>();
- var results = new List<ParseResult<TValue>>();
- results.Add(parser(input));
- while (!results.Last().HasError)
- {
- list.Add(results.Last().Value);
- rest = results.Last().Rest;
- results.Add(parser(rest));
- }
- return new ParseResult<IList<TValue>>(rest, list);
- };
- }
- public static ParseAction<TValue> Or<TValue>(ParseAction<TValue> parser1, ParseAction<TValue> parser2)
- {
- return delegate(Position input)
- {
- var result1 = parser1(input);
- if (result1.HasError) return parser2(input);
- return result1;
- };
- }
- public static ParseAction<Chain<TValue1, TValue2>> And<TValue1, TValue2>(ParseAction<TValue1> parser1, ParseAction<TValue2> parser2)
- {
- return delegate(Position input)
- {
- var result1 = parser1(input);
- if (result1.HasError)
- return new ParseResult<Chain<TValue1, TValue2>>(result1.Error);
- var result2 = parser2(result1.Rest);
- if (result2.HasError)
- return new ParseResult<Chain<TValue1, TValue2>>(result2.Error);
- var value = new Chain<TValue1, TValue2>(result1.Value, result2.Value);
- return new ParseResult<Chain<TValue1, TValue2>>(result2.Rest, value);
- };
- }
- }
Each of the above methods generate a ParseAction for a specific, useful purpose. The first matches a specific char while Rep matches a repeated sequence of a pattern. Using these, we can write the following code (I've removed the references to the Grammar class from the method calls to simplify the example):
- var parseChar = Ch('c');
- var parseChars = Rep(parseChar);
- var result1 = parseChar(new Position("c"));
- var result2 = parseChars(new Position("cccc"));
As you can see, we used one parser to create another. The Rep function keeps using the Ch('c') parser to match characters, which it stuffs into a List<char>. The methods Or and And does the same for matching one pattern or another, or one pattern followed by another, respectively. As a side note, you can see that I have supplied none of the type arguments for the generic function. This is because of the powerful type inference capability of the C# compiler.
- var parseAnotherChar = Ch('k');
- var parseEitherChar = Rep(Or(parseChar, parseAnotherChar));
- var parseBothChars = Rep(And(parseChar, parseAnotherChar));
- var result1 = parseEitherChar(new Position("kkkckckc"));
- var result2 = parseBothChars(new Position("ckckckckck"));
Here, we define two parsers that matches series of characters. The one using Or matches a sequence where each char can either be c or k where the one using And matches only sequences of cs followed by a k. To end up with more fluent code, we can define a few useful extension methods:
- public static class GrammarExtensions
- {
- public static ParseAction<TValue> Or<TValue>(this ParseAction<TValue> parser1, ParseAction<TValue> parser2)
- {
- return Grammar.Or(parser1, parser2);
- }
- public static ParseAction<Chain<TValue1, TValue2>> And<TValue1, TValue2>(this ParseAction<TValue1> parser1, ParseAction<TValue2> parser2)
- {
- return Grammar.And(parser1, parser2);
- }
- }
- // Allows for the following
- var parseEitherChar = Rep(parseChar.Or(parseAnotherChar));
- var parseBothChars = Rep(parseChar.And(parseAnotherChar));
As you might have noticed, the And parser returns an object of the type Chain. This is simply a container for the results of the two parse actions. However, it introduces the problem of creating increasingly complex result types as your parsers become more complex. To reduce this problem we introduce the Build function which translates a complex result into a simpler one. It basically just takes the result of a parser and applies a function to it.
- public static ParseAction<TValue2> Build<TValue1, TValue2>(ParseAction<TValue1> parser, Func<TValue1, TValue2> builder)
- {
- return delegate(Position input)
- {
- var result = parser(input);
- if (result.HasError) return new ParseResult<TValue2>(result.Error);
- try
- {
- return new ParseResult<TValue2>(result.Rest, builder(result.Value));
- }
- catch (ParseException e)
- {
- return new ParseResult<TValue2>(e.Message, input.Line, input.Column);
- }
- };
- }
- var parseBothChars = And(parseChar, parseAnotherChar).Build(hit => hit.Left + hit.Down);
In this case, we use Build to change the result type from Chain<char,char> to string. Once again, the type inferral of the C# compiler comes to our rescue, providing us with proper intellisense on the properties on the Chain object. This will save your sanity when building from more complex chains. Also notice that we catch ParseException errors, a custom exception type which we can use to signal that some parse rule was broken when trying to build the object.
These, and more, are common building blocks for any grammar and, therefore, our primary job is now to create a Grammar class which holds all the basic parsers. To create a more specialized grammar you simply inherit from Grammar and add new parsers.
We're getting to the end of this article but hopefully you got the idea of how this all works. If not, I urge you to go read up on Louis' blog on the same topic. I've also uploaded some of the code from my compiler so you can see some of the other parsers to put in the Grammar class. The code won't compile as it relies on some other code of mine, but it should be possible to understand it regardless.
In the next article we'll begin to create a more specialized grammar for the CGL language using the building blocks presented here.
Comments