Log in to like this post! Infragistics Parsing Framework - Removing Ambiguities, Part 1 Mike Dour / Monday, December 3, 2012 This post is part of a series on the parsing framework provided by Infragistics starting in version 12.2. The previous post in the series can be found here and the next post can be found here. As I mentioned in my first post, it is important to keep the number of ambiguities in a grammar to a minimum to ensure documents are parsed as quickly as possible. I’ll discuss a few common mistakes which can slow down the parser and provide ways to re-write the grammar rules so they are unambiguous but still describe the same language or a similar language. What Causes Ambiguities? Recall that at the end of the previous post I mentioned all ambiguities, also known as conflicts, in an LR(k) parser occur when one or both of the following occurs: The next k tokens from the lexer tell the parser that two or more different production bodies have been read completely and can be reduced to their head symbols. This is known as a reduce/reduce conflict. The next k tokens from the lexer tell the parser that a production body has been read completely and can be reduced to its head symbol, but the first of those tokens also moves the parser along in a different production body which is not yet complete. This is known as a shift/reduce conflict. What may not seem so obvious about this is that any number of productions can be “in construction” while reading the same tokens from the lexer. The parser only needs to decide what production it created after reading all tokens in the body plus the next k tokens. When a conflict occurs for a token, the parser will split itself into two or more parsers and each one will continue processing tokens until all parsers but one fail to parse due to errors (a local ambiguity), or multiple sub-trees are completed to represent the same set of tokens from the lexer (a global ambiguity). Before I begin discussing ambiguities, it is important to keep in mind that the parser provided by Infragistics uses one token of look-ahead, so it is an LR(1) parser which can handle ambiguities. Because of this, I will only discuss ambiguities which occur with one token of look-ahead. I will not try to generalize the post to discuss ambiguities with k tokens of look-ahead. Reduce/Reduce Conflicts First, I’ll discuss reduce/reduce conflicts. The general set of conditions which must be met for a reduce/reduce conflict to occur are as follows: Two or more non-terminal symbols have production bodies which can represent the same sequence of tokens. The non-terminal symbols can be used in the same context. A terminal symbol which can follow them is the same. When this happens, we have a reduce/reduce conflict. For example, consider the grammar with the following productions and “S” defined as the start symbol (uppercase letters are non-terminal symbols and lowercase letters are terminal symbols). S → v A x y S → v B x z A → w B → w B → u In this example, the non-terminal symbols with production bodies that can represent the same sequence of tokens are “A” and “B”. They can be used in the same context: specifically when the parser has received and processed a <v> as the first token from the lexer. And the terminal symbol which can follow both non-terminal symbols is “x”. Now let’s say the lexer provides the following tokens to the parser: <v><w><x><y>. When the parser gets to the <x> token, it is a look-ahead token which indicates a reduction is possible from “w” to “A” and another reduction is possible from “w” to “B”. This will split the parser so both options can be tried and both parsers will need to process subsequent tokens from the lexer. Upon reaching the <y> token, the first parser will read it and continue, but the second parser will fail because the “B” non-terminal cannot be followed by “x y” and it will go away, allowing just a single parser to continue. However, even though this will eventually be resolved by reading enough tokens from the lexer, it does slow down the performance of the parser when encountering these situations, so it should be avoided when possible. The basic problem here, as with all shift/reduce and reduce/reduce conflicts is that there are too many reductions possible and therefore too many parent nodes to create for the tree. We need to find some way to decrease the number of reductions (I almost said “reduce the number of reductions” but I thought that might be confusing). I see two solutions to this problem. The first solution is what I’ll call “merging” the non-terminals: the non-terminals which can represent the same symbols can be combined into one non-terminal if possible. In our original grammar above, “A” and “B” could be merged into a single non-terminal “AB” which has all unique productions from “A” and “B”. Here is how the productions would look: S → v AB x y S → v AB x z AB → w AB → u Now when the lexer provides tokens <v><w><x><y>, the <x> token is a look-ahead which indicates that a reduction to only the “AB” non-terminal symbol can be performed and there is no ambiguity. We decreased the number of possible reductions to one. The downside to this approach is that merging the non-terminals could mean the accepted language is less restrictive than the original. In this example, the series of tokens <v><u><x><y> will be parsed without errors whereas it would have been an error when using the original grammar. However, detecting this error could be done in code with semantic analysis on the parse tree after it is created. This is a tradeoff the grammar writer must weigh: should the grammar allow for a less restrictive, faster parse without ambiguities or should the grammar parse correctly, slower, and require less work to analyze semantics of the text after the parse is complete. The second solution is what I’ll call “promoting” the non-terminals’ children: the symbols represented by the non-terminals can be moved into the place where the non-terminals were used. So in this case, the grammar could be changed so that it only contains the following productions: S → v w x y S → v w x z S → v u x z (Note that when promoting the “B” non-terminal symbol the production which uses it must be duplicated so that each variation of “B” can be represented) Now when the parser reaches the <x> token, it doesn’t have to create multiple non-terminals. In fact, it doesn’t have to create any non-terminals. We decreased the number of possible reductions to zero. It can continue constructing both productions without a conflict. However, there are some downsides to this approach. First of all, there will no longer be nodes to represent “A” or “B” in the final parse tree. This changes the grammatical structure of a parsed document and may require some additional work on the part of the developer to analyze the tree as if the “correct” grammar were being used. The second downside is that it could increase the number of productions in the grammar by a huge factor. As we saw, when a production uses a promoted non-terminal symbol, that production must be duplicated n times, where n is the number of productions which have the promoted non-terminal symbol as the head. If a production uses multiple non-terminal symbols which were replaced, the number of copies required to promote the non-terminals could be huge. Consider a production which used three instances of “B” in its body. If we replace the first “B”, the production needs to be duplicated twice for each variation of “B”. If we replace the second “B” as well, each production needs to be duplicated again and we end up with 4 productions. After replacing the last “B”, we end up with 8 productions. And just imagine if “B” were the head symbol for 20 different productions: Promoting “B” into this production would result in 8,000 unique productions. Another thing to note is that this approach will not work if the non-terminal is used as part of a repetition. So if you were using a slightly modified grammar from the one above and your EBNF file has the following rules: S = (v, {A}, x, y) | (v, {B}, x, z);A = w;B = w | u; You could not solve this problem by using this rule instead: S = (v, {w}, x, y) | (v, {w | u}, x, z); The reason for this is that we internally generate a hidden non-terminal to take the place of repetitions, so what these new rules actually produce for the parser is these productions: S → v S1 x y S → v S2 x z S1 → S1 → w S1 S2 → S2 → w S2 S2 → u S2 So upon reaching the <x> token, the parser still has a reduce/reduce conflict between “S1” and “S2” and the problem hasn’t been solved. Now there is something important to note in how I worded the 3rd condition for reduce/reduce conflicts. I said “A terminal symbol which can follow them [the non-terminals representing the same tokens] is the same.” I did not say “A terminal symbol which can follow them in the ambiguous context is the same.” There is an important distinction here which an example will help illustrate. Let’s say we have a grammar with the following productions (again, “S” is the start symbol): S → A, y S → B, z S → w B y A → x B → x If I had said this type of conflict occurs when “A terminal symbol which can follow them in the ambiguous context is the same,” then the lexer providing the tokens <x><y> would not cause a conflict. After reading the <x> token, the parser is in a state where it could create an “A” for productoin 1 or a “B” for production 2. Then the <y> would be a look-ahead token for “A” because production 1 allows for “A” followed by “y”. But it would not be a look-ahead token for “B”, because production 2 does not allow “B” followed by “y”, so no conflict would occur. The fact that production 3 allows for “B” followed by “y” is irrelevant because production 3 was not being constructed by the parser at the time: it did not read a <w> from the lexer at the start of the parse. However, this is not the case for the Infragistics parser. As I said, the condition is “A terminal symbol which can follow them is the same.” A “y” can follow a “B” in production 3, which means when the parser is in a state where a “B” can occur next and it reads an <x> token followed by a <y> token, it will always try to reduce a “B”, regardless of the context. [As an aside, the parser uses parse tables to make parsing decisions. These tables indicate when to reduce productions and when to continue forming them based on the current state of the parser and the current token from the lexer. There are different ways to create these tables and they produce tables with varying restrictions and sizes. One type of table construction produces what are known as canonical-LR(k) parse tables. These tables only have reduce/reduce conflicts when the terminal symbol actually causes a conflict in the ambiguous context, but the table sizes could be way too large for a production parser of a typical programming language. Another type of table construction produces what are known as LALR(k) parse tables, which are much smaller than canonical-LR(k) parse tables for the same grammar and k value, but they may produce unnecessary reduce/reduce conflicts, as described here. The Infragistics parsing framework creates LALR(1) parse tables based on the grammars provided by you because the amount of memory saved by the smaller tables for a typical grammar is a bigger benefit than the potential performance hit incurred by the extra reduce/reduce conflicts.] Next time I’ll discuss shift/reduce conflicts. By Mike Dour The information contained herein is provided for reference purposes only. Copyright © 2012 Infragistics, Inc., All Rights Reserved.