Wait the light to fall

第十七章. Grammars

焉知非鱼

声明

本章翻译仅用于 Raku 学习和研究, 请支持电子版或纸质版

第十七章. Grammars

Grammars are patterns on a higher plane of existence. They integrate and reuse pattern fragments to parse and react to complicated formats. This feature is at the core of Raku in a very literal sense; the language itself is implemented as a grammar. Once you start using it you’ll probably prefer it to regexes for all but the most simple problems.

Grammars 是存在于更高层面上的模式。它们集成并重用模式片段来解析复杂的格式并做出反应。从字面意义上讲,这个功能是Raku的核心;语言本身是作为语法实现的。一旦你开始使用它,你可能更喜欢它除了最简单的问题之外的所有正则表达式。

A Simple Grammar

A grammar is a special sort of package. It can have methods and subroutines but mostly comprises special pattern methods called regex, token, and rule. Each of these define a pattern and apply different modifiers.

NOTE #

Raku tends to refer to regex, token, and rule declarations as “rules,” which can be a bit imprecise at times. In this book, you can tell the difference between the language keyword and the general term by the typesetting. I’ll try to not present an ambiguous situation.

Start with something simple (too simple for grammars). Define a TOP pattern that matches digits as the starting point. That name is special because .parse uses it by default. In this example, you declare that with regex:

Grammar 是一种特殊的包。它可以有方法和子程序,但主要包括称为 regextokenrule 的特殊模式方法。其中每个都定义了一个模式并应用了不同的修饰符。

注意 #

Raku 倾向于将 regextokenrule 声明称为“规则”,有时可能有点不精确。在本书中,您可以通过排版来区分语言关键字和一般术语。我会尽量不提出模棱两可的情况。

从简单的东西开始(对于 grammar 来说太简单了)。定义匹配数字作为起点的 TOP 模式。该名称很特殊,因为 .parse 默认使用它。在此示例中,您使用 regex 声明一个 TOP

grammar Number {
    regex TOP { \d }
    }

my $result = Number.parse( '7' );  # works

put $result ?? 'Parsed!' !! 'Failed!';  # Parsed!

This succeeds. .parse applies the grammar to the entire value of 7. It starts with the parts that TOP describes. It can match a digit, and the value you pass to .parse is a digit.

When .parse succeeds, it returns a Match object (it returns Nil when it fails). Try it with a different value. Instead of a single digit, try several digits:

这成功了。 .parse 将 grammar 应用于整个值 7. 它从 TOP 描述的部分开始。它可以匹配一个数字,你传递给 .parse 的值是一个数字。

.parse 成功时,它返回一个 Match 对象(当它失败时返回 Nil)。尝试使用不同的值。尝试几个数字而不是单个数字:

my $result = Number.parse( '137' );  # fails (extra digits)

put $result ?? 'Parsed!' !! 'Failed!';  # Failed!

This time .parse doesn’t succeed. It starts matching with the first character and ends matching on the last character. It asserts that the text starts, there is a single digit, and the text ends. If .parse sees that there are some characters before or after its match, it fails. It matches everything or not at all. It’s almost the same thing as explicitly using anchors:

这次 .parse 没有成功。它开始与第一个字符匹配,并在最后一个字符上结束匹配。它断言文本开始,有一个数字,文本结束。如果 .parse 看到匹配之前或之后有一些字符,则会失败。它匹配全部或根本不匹配。它与显式地使用锚点几乎相同:

grammar Number {
    regex TOP { ^ \d+ $ }  # explicitly anchored
    }

But TOP is only the default starting point for a grammar. You can tell .parse where you’d like to start. This version defines the same pattern but calls it digits instead of TOP:

TOP 是仅有的 grammar 的默认起点。你可以告诉 .parse 你想要开始的地方。此版本定义相同的模式但称为 digits 而不是 TOP

grammar Number {
    regex digits { \d+ }
    }

Tell .parse where to start with the :rule named argument:

使用 :rule 命名参数告诉 .parse 从哪里开始:

my @strings = '137', '137 ', ' 137 ';

for @strings -> $string {
    my $result = Number.parse( $string, :rule<digits> );
    put "「$string」 ", $result ?? 'Parsed!' !! 'Failed!';
    }

The first element of @strings parses because it is only digits. The other ones fail because they have extra characters:

@strings 的第一个元素解析成功了因为它只是数字。其他的失败了因为他们有额外的字符:

「137」 parsed!
「137 」 failed!
「 137 」 failed!

Declare digits with rule instead of regex. This implicitly allows whitespace after any part of your pattern:

使用 rule 而不是 regex 声明 digits。这隐式地允许在模式的任何部分之后有空格:

grammar Number {
    rule digits { \d+ }  #  not anchored, and works
    }

Now the second Str matches too because the implicit whitespace can match the space at the end (but not the beginning):

现在第二个 Str 也匹配,因为隐式空格可以匹配末尾的空格(但不是开头):

「137」 parsed!
「137 」 parsed!
「 137 」 failed!

The rule applies :sigspace to its pattern. It’s the same thing as adding that adverb to the pattern:

rule:sigspace 应用到其模式。将该副词添加到模式中是一回事:

grammar Number {
    regex digits { :sigspace \d+ }
    }

:sigspace inserts the predefined <.ws> after pattern tokens. Since there’s a dot before the name ws, the <.ws> does not create a capture. It’s the same as adding optional whitespace explicitly:

:sigspace在模式标记之后插入预定义的 <.ws>。由于名称 ws 之前有一个点号,<.ws> 不会创建捕获。它与显式添加可选空格相同:

grammar Number {
    regex digits { \d+ <.ws> }
    }

Instead of showing Parsed!, you can on success output the Match object you stored in $result:

您可以在成功输出存储在 $result 中的 Match 对象,而不是显示 Parsed!

grammar Number {
    regex digits { \d+ <.ws> }
    }

my @strings = '137', '137 ', ' 137 ';

for @strings -> $string {
    my $result = Number.parse( $string, :rule<digits> );
    put $result ?? $result !! 'Failed!';
    }

The output isn’t that different, but instead of its success status you see the text that matched:

输出没有那么不同,但您可以看到匹配到的文本,而不是其成功状态:

「137」
「137 」
Failed!

Modify the grammar to remove that dot from <.ws> so it captures whitespace and try again:

修改 grammar 以从 <.ws> 中删除该点号,以便捕获空格并再次尝试:

grammar Number {
    regex digits { \d+ <ws> }
    }

Now the output shows the nested levels of named captures:

现在输出显示了命名捕获的嵌套级别:

「137」
 ws => 「」
「137 」
 ws => 「 」
Failed!

This still doesn’t match the Str with leading whitespace. The parser couldn’t match that since rule only inserts <.ws> after explicit parts of the pattern. To match leading whitespace you need to add something to the front of the pattern. The beginning-of-string anchor does that, and now there’s something that <.ws> can come after:

这仍然与带有前导空格的 Str 不匹配。解析器无法匹配,因为 rule 仅在模式的显式部分之后插入 <.ws>。要匹配前导空格,您需要在模式的前面添加一些内容。字符串开头的锚点就是这样,现在有一些 <.ws> 后面可以出现的东西:

grammar Number {
    rule digits { ^ \d+ }    # ^ <.ws> \d+ <.ws>
    }

There’s also the zero-width always-matches token, <?>:

还有零宽度始终匹配的 token 标记,<?>

grammar Number {
    rule digits { <?> \d+ }  #  <?> <.ws> \d+ <.ws>
    }

Most of the time you don’t want to play these games. If you want leading whitespace, you can note that explicitly (and you probably don’t want to capture it):

大多数时候你不想玩这些游戏。如果你想要前导空格,你可以显式地注意到(并且你可能不想捕获它):

grammar Number {
    rule digits { <.ws> \d+ }  # <.ws> \d+ <.ws>
    }

Use token instead of rule if you don’t want any implicit whitespace:

如果您不想要任何隐式空格,请使用 token 而不是 rule

grammar Number {
    token digits { \d+ }  # just the digits
    }

You’ll see another feature of rule and token later in this chapter.

您将在本章后面看到 ruletoken 的另一个功能。

EXERCISE 17.1Write a grammer to match octal digits, with or without a leading 0 or 0o. Your grammar should parse numbers such as 123, 0123, and 0o456, but not 8, 129, or o345.

练习17.1写一个 grammar 来匹配八进制数字,带或不带前导 00o。您的 grammar 应该解析诸如 123, 01230o456 之类的数字,但不能解析 8 ,129o345

Multiple Rules

Grammars wouldn’t be useful if you were limited to one rule. You can define additional rules and use them inside other rules. In the first exercise you had only the TOP rule but you could separate the pattern into parts. Break up the pattern in TOP into rules for prefix and digits. It’s this decomposability that makes it so easy to solve hard parsing problems:

如果你只限于一条规则,那么 grammar 就没用了。您可以定义其他规则并在其他规则中使用它们。在第一个练习中,您只有 TOP 规则,但您可以将模式分成几部分。将 TOP 中的模式分解为 prefixdigits的规则。正是这种可分解性使得解决困难的解析问题变得如此简单:

grammar OctalNumber {
    regex TOP          { <prefix>? <digits>  }
    regex prefix       {  [ 0o? ]  }
    regex digits       { <[0..7]>+ }
    }

my $number = '0o177';
my $result = OctalNumber.parse( $number );
say $result // "failed";

The stringified Match object shows the overall match and the named subcaptures:

字符串化的 Match 对象显示整体匹配和命名的子捕获:

「0o177」
 prefix => 「0o」
 digits => 「177」

You can access the pieces:

你可以访问这些部分:

put "Prefix: $result<prefix>";
put "Digits: $result<digits>";

EXERCISE 17.2Create a grammar to match a Raku variable name with a sigil (ignore sigilless variables, because that’s too easy). Use separate rules to match the sigil and the identifier. Here is a list of candidates to check if you don’t come up with your own:my @candidates = qw/ sigilless $scalar @array %hash $123abc $abc'123 $ab'c123 $two-words $two- $-dash /;

You can suppress some of those named captures by prefixing the rule with a dot. You probably don’t care about the prefix, so don’t save it:

练习17.2 创建一个 grammar,匹配带有 sigil 的 Raku 变量名(忽略无符号变量,因为这太简单了)。使用单独的规则来匹配 sigil 和标识符。这是一个候选人列表,检查你是否没有自己的:my @candidates = qw/ sigilless $scalar @array %hash $123abc $abc'123 $ab'c123 $two-words $two- $-dash /;

您可以通过在规则前加一个点号来抑制某些命名捕获。您可能不关心前缀,所以不要保存它:

grammar OctalNumber {
    regex TOP          { <.prefix>? <digits> }
    regex prefix       {  [ 0o? ]  }
    regex digits       { <[0..7]>+ }
    }

my $number = '0o177';
my $result = OctalNumber.parse( $number );
say $result // "failed";

The output doesn’t include the prefix information:

输出不包含前缀信息:

「0o177」
 digits => 「177」

This doesn’t make much of a difference in this small example, but imagine a complicated grammar with many, many rules. That brings you to the next big feature of grammars. Besides the grammar itself, you can specify an action class that processes the rules as the grammar successfully parses them.

这在这个小例子中并没有太大的区别,但想象一下复杂的 grammar 有很多很多规则。这将带您进入 grammar 的下一个重要特征。除 grammar 本身外,您还可以指定一个 action 类来处理规则,因为 grammar 会成功解析它们。

Debugging Grammars

There are two modules that can help you figure out what’s going on in your grammar. Both are much more impressive in your terminal.

有两个模块可以帮助您弄清楚 grammar 中发生了什么。两者在你的终端中都更令人印象深刻。

Grammar::Tracer #

The Grammar::Tracer module shows you the path through a grammar (and applies to any grammar in its scope). Merely loading the module is enough to activate it:

Grammar::Tracer 模块向您显示 grammar 的路径(并适用于其作用域内的任何 grammar)。仅加载模块就足以激活它:

use Grammar::Tracer;

grammar OctalNumber {
    regex TOP          { <prefix>? <digits>  }
    regex prefix       {  [ 0o? ]  }
    regex digits       { <[0..7]>+ }
    }

my $number = '0o177';
$/ = OctalNumber.parse( $number );
say $/ // "failed";

The first part of the output is the trace. It shows which rule it’s in and the result. In this example each one matches:

输出的第一部分是跟踪。它显示了它所在的规则和结果。在这个例子中,每个规则都匹配:

TOP
|  prefix
|  * MATCH "0o"
|  digits
|  * MATCH "177"
* MATCH "0o177"
「0o177」
 prefix => 「0o」
 digits => 「177」

Changing the data to include invalid digits, such as 0o178, means the grammar will fail. In the trace you can see it matches up to 0o17 but can’t continue, so you know where in your Str things went wrong. It could be that the grammar should not match the text or the grammar is not as accommodating as it should be:

更改数据以包含无效数字(例如 0o178)意味着 grammar 将失败。在跟踪中,您可以看到它最多匹配到 0o17 但无法继续,因此你就知道 Str 中的哪些地方出错了。可能是 grammar 不应该与文本匹配,或者 grammar 不应该像它应该的那样适应:

TOP
|  prefix
|  * MATCH "0o"
|  digits
|  * MATCH "17"
* MATCH "0o17"
digits
* FAIL
digits
* MATCH "0"
failed

Instead of adding Grammar::Tracer to your program you can load it from the command line with the -Mswitch. You probably don’t mean to leave it in anyway:

您可以使用 -M 开关从命令行加载 Grammar::Tracer,而不是将 Grammar::Tracer 添加到程序中。你可能并不是故意把它留下来:

% raku -MGrammar::Tracer program.p6

Grammar::Debugger #

The Grammar::Debugger module does the same thing as Grammar::Tracer (they come together in the same distribution) but allows you to proceed one step at a time. When you start it you get a prompt; type h to get a list of commands:

Grammar::Debugger 模块与 Grammar::Tracer (它们在同一个发行版中)执行相同的操作,但允许您一次执行一个步骤。当你启动它时,你得到一个提示; 键入 h 以获取命令列表:

% raku -MGrammar::Debugger test.p6
TOP
> h
    r              run (until breakpoint, if any)
    <enter>        single step
    rf             run until a match fails
    r <name>       run until rule <name> is reached
    bp add <name>  add a rule name breakpoint
    bp list        list all active rule name breakpoints
    bp rm <name>   remove a rule name breakpoint
    bp rm          removes all breakpoints
    q              quit

Typing Enter with no command single-steps through the parse process and gives you a chance to inspect the text and the state of the parser. The rf command will get you to the next failing rule:

在没有命令的情况下键入回车键单步执行解析过程,并让你有机会检查文本和解析器的状态。 rf 命令会使你进入下一个失败的规则:

> rf
|  prefix
|  * MATCH "0o"
|  digits
|  * MATCH "17"
* MATCH "0o17"
digits
* FAIL
>

A Simple Action Class

A grammar does its work by descending into its rules to take apart text. You can go the opposite way by processing each part of the parsed text to build a new Str (or data structure, or whatever you like). You can tell .parse to use an action class to do this.

grammar 通过下降到它的规则中分解文本来完成其工作。你可以通过处理已解析文本的每个部分来构建新的 Str(或数据结构,或任何您喜欢的任何内容)。您可以告诉 .parse 使用 action 类来执行此操作。

Here’s a simple action class, OctalActions. It doesn’t need to have the same name as the grammar, but the method names are the same as the rule names. Each method takes a Match object argument. In this example, the signature uses $/, which is a variable with a few advantages that you’ll see in a moment:

这是一个简单的 action 类 OctalActions。它不需要与 grammar 具有相同的名称,但方法名称与规则名称相同。每个方法都接收 Match 对象参数。在此示例中,签名使用 $/,这是一个具有一些优势的变量,稍后你将看到:

class OctalActions {
    method digits ($/) { put "Action class got $/" }
    }

grammar OctalNumber {
    regex TOP          { <.prefix>? <digits>  }
    regex prefix       {  [ 0o? ]  }
    regex digits       { <[0..7]>+ }
    }

Tell .parse which class to use with the :actions named parameter. The name does not need to correspond to the grammar:

使用 :actions 命名参数告诉 .parse 使用哪个类。该名称不需要与 grammar 对应:

my $number = '0o177';
my $result = OctalNumber.parse(
    $number, :actions(OctalActions)
    );
say $result // "failed";

This action class doesn’t do much. When the digits rule successfully matches it triggers the rule of the same name in the action class. That method merely outputs the argument:

这个 action 类做的不多。当 digits 规则成功匹配时,它会触发 action 类中相同名称的规则。该方法仅输出参数:

Action class got 177
「0o177」
 digits => 「177」

EXERCISE 17.3Implement your own action class for the OctalNumber grammar. When the digits method matches, output the decimal version of the number. The parse-base routine from Str may be useful. For extra credit, take one number per line from standard input and turn them into decimal numbers.

练习17.3 为 OctalNumber grammar 实现自己的 action 类。当 digits 方法匹配时,输出数字的十进制版本。 Strparse-base 例程可能很有用。如需额外学分,请从标准输入中每行获取一个数字并将其转换为十进制数字。

Creating an Abstract Syntax Tree #

Actions shouldn’t output information directly. Instead, they can add values to the Match object. Calling makein the action method sets a value in the abstract syntax tree (or .ast) slot of the Match. You can access that with .made:

Action 不应直接输出信息。相反,他们可以向 Match 对象添加值。在 action 方法中调用 make 会在 Match 的抽象语法树(或 .ast )槽中设置一个值。您可以使用 .made 访问它:

class OctalActions {
    method digits ($/) {
        make parse-base( ~$/, 8 ) # must stringify $/
        }
    }

grammar OctalNumber {
    regex TOP          { <.prefix>? <digits>  }
    regex prefix       {  [ 0o? ]  }
    regex digits       { <[0..7]>+ }
    }

my $number = '0o177';
my $result = OctalNumber.parse(
    $number, :actions(OctalActions)
    );
put $result ??
    "Turned 「{$result<digits>}」 into 「{$result<digits>.made}」"
    !! 'Failed!';

The make puts something into the .ast slot of the Match and .made gets it back out. You can make any value that you like, including containers, objects, and most other things you can imagine. You still get the original, literal match.

In the previous example, the digits action method handled the value. A TOP action method could do it, but it has to reach one level below the Match object:

make 将一些内容放入Match.ast 插槽中,然后 .made 将其恢复原状。您可以make 任何您喜欢的值,包括容器,对象和您可以想象的大多数其他内容。你仍然得到原始的,字面上的匹配。

在前面的示例中,digits action 方法处理了该值。 TOP action 方法可以做到,但它必须到达 Match 对象下面的一个级别:

class OctalActions {
    method digits ($/) {
        make parse-base( ~$/, 8 ) # must stringify $/
        }
    }

grammar OctalNumber {
    regex TOP          { <.prefix>? <digits>  }
    regex prefix       {  [ 0o? ]  }
    regex digits       { <[0..7]>+ }
    }

my $number = '0o177';
my $result = OctalNumber.parse(
    $number, :actions(OctalActions)
    );
put $result.so ??
    "Turned 「{$number}」 into 「{$result.made}」"
    !! 'Failed!';

You don’t have to use $/ in the signature; it’s a convenience. There’s nothing particularly magical about it. You could use some other variable if you are paid by the character:

您不必在签名中使用 $/; 这是一个方便写法。它没什么特别神奇的。如果你有其它字符,您可以使用其他变量:

class OctalActions {
    method TOP ($match) { make parse-base( ~$match<digits>, 8 ) }
    }

EXERCISE 17.4Create a grammar to parse a four-part, dotted-decimal IP address, such as 192.168.1.137. Create an action class that turns the parse results into a 32-bit number. Output that 32-bit number in hexadecimal.

练习17.4 创建一个 grammar 来解析一个由四部分组成的点分十进制 IP 地址,例如 192.168.1.137。创建一个 action 类,将解析结果转换为32位数。以十六进制输出那个32位数。

Ratcheting

The rule and token declarators have a feature that regex doesn’t; they both prevent backtracking by implicitly setting the :ratchet adverb. Once one of those rules matches they don’t backtrack to try again if there’s a failure later in the grammar.

Here’s a nonsense grammar that includes a rule <some-stuff> that matches one or more of any character. The TOP token wants to match digits surrounded by unspecified stuff:

ruletoken 声明符具有 regex 不具有的功能;他们都通过隐式设置 :ratchet 副词来阻止回溯。一旦这些规则中的一个匹配,如果在 grammar 中稍后出现失败,则它们不会回溯以再次尝试。

这是一个无意义的 grammar,其中包含能匹配一个或多个字符的 <some-stuff> 规则。 TOP token 想要匹配由未指定的东西包围的数字:

grammar Stuff {
    token TOP { <some-stuff> <digits> <some-stuff> }
    token digits       { \d+ }
    token some-stuff   { .+  }
    }

This Str could satisfy that pattern. It has stuff, some digits, and more stuff:

这个字符串可以满足这种模式。它有东西,一些数字和更多的东西:

my $string = 'abcdef123xyx456';

But, Stuff fails to parse it:

但是,Stuff 无法解析它:

my $result = Stuff.parse( $string );
put "「$string」 ", $result ?? 'Parsed!' !! 'Failed!'; # Failed!

It’s the :ratchet that makes it fail. Work out its path to see why. TOP has to first match <some-stuff>. That matches any character one or more times, greedily—it matches the entire text. TOP next needs to match<digits>, but there is nothing left to match because of that greediness. Without :ratchet the pattern might roll back some of the characters it already consumed. With :ratchet it doesn’t do that. The grammar can’t match the rest of TOP and it fails.

Without :ratchet the situation is different. If you use regex instead of token, you allow the grammar to give back characters it has already matched:

:ratchet 使它失败的。找出原因,看看为什么。 TOP 必须首先匹配 <some-stuff>。这匹配任何一个字符一次或多次,贪婪地 - 它匹配整个文本。 TOP 接着需要匹配 <digits>,但由于这种贪婪,没有什么可以匹配的了。如果没有 :ratchet 模式可能会回滚它已经消耗的一些字符。使用 :ratchet 它不会那样做。Grammar 不能匹配 TOP 的其余部分,所以失败了。

没有 :ratchet 的情况是不同的。如果使用 regex 而不是 token,则允许 grammar 归还已匹配的字符:

grammar Stuff {
    # regex does not turn on ratcheting
    regex TOP { <some-stuff> <digits> <some-stuff> }
    token digits       { \d+ }
    regex some-stuff   { .+  }
    }

That could match. The TOP matches <some-stuff> but realizes it’s run out of text and starts backtracking. All parts of the grammar that want to allow backtracking have to use regex. It’s not good enough for TOP to backtrack but not <some-stuff>.

那可能会匹配。 TOP 匹配 <some-stuff>,但意识到它已用完文本并开始回溯。想要允许回溯的 grammar 的所有部分都必须使用 regex。对于 TOP 来说,回溯并不是足够好,除了 <some-stuff>

Parsing JSON

In Mastering Perl I presented a JSON parser that Randal Schwartz created using some advanced features of Perl 5 regular expressions. In many ways his implementation was a grammar, but he was forced to inseparably combine the parsing and the actions. That made the regular expression almost impenetrable. It’s much cleaner and more accessible to write it as a Raku grammar.

JSON is actually quite simple with only a few weird things to handle, but it gives you the opportunity to see how proto rules can simplify actions:

在 Mastering Perl 中,我提到了一个 Randal Schwartz 使用 Perl 5 正则表达式的一些高级功能创建的 JSON 解析器。在许多方面,他的实现是一种 grammar,但他被迫不可分割地将解析和 action 组合在一块。这使得正则表达式几乎无法穿透。用 Raku grammar 编写它会更清晰,更容易访问。

JSON 实际上非常简单,只需处理几个奇怪的事情,但它让您有机会了解 proto 规则如何简化 action:

grammar Grammar::JSON {
    rule TOP                { <.ws> <value> <.ws> }

    rule object             { '{' ~ '}' <string-value-list> }
    rule string-value-list  { <string-value> * % ',' }
    token string-value      { <string> <.ws> ':' <.ws> <value> }

    rule array              { '[' ~ ']' <list> }
    rule list               { <value> * % ',' }

    token value             {
        <string> | <number> | <object> | <array> |
        <true> | <false> | <null>
        }

    token true  { 'true'  }
    token false { 'false' }
    token null  { 'null'  }

    token string {
        (:ignoremark \" ) ~ \"
        [
            <u_char>              |
            [ '\\' <[\\/bfnrt"]> ] |
            <-[\\\"\n\t]>+
        ]*
        }

    token u_char {
        '\\u' <code_point>
        }

    token code_point { <[0..9a..fA..F]>**4 }

    token number {
        '-' ?
        [ 0 | <[1..9]><[0..9]>* ]
        [ '.' <[0..9]>+ ]?
        [ <[eE]> <[+-]>? <[0..9]>+ ]?
        }
    }

You may be surprised at how easy and short that grammar is. It’s almost a straight translation of the grammar from RFC 8259. Now, create an action class for that:

您可能会对这个 grammar 的简单和简短感到惊讶。它几乎是 RFC 8259 grammar 的直接翻译。现在,为此创建一个 action 类:

class JSON::Actions {
    method TOP ($/) { make $<value>.made }
    method object ($/) {
        make $<string-value-list>.made.hash.item;
        }
    method array ($/) {
        make $<list>.made.item;
        }

    method true       ($/) { make True }
    method False      ($/) { make False }
    method null       ($/) { make Nil }

    method value      ($/) { make (
        $<true> || $<false> || $<null> || $<object> ||
        $<array> || $<string> || $<number> ).made
        }

    method string-value-list ($/) {
        make $<string-value>>>.made.flat;
        }

    method string-value ($/) {
        make $<string> => $<value>
        }

    method list       ($/) { make ~$/ }
    method string     ($/) { make $<uchar>.made || ~$/ }

    method u_char     ($/) { make $<code_point>.made }
    method code_point ($/) { make chr( (~$/).parse-base(16) ) }
    method number     ($/) { make +$/ }
    }

Look at the clunky handling of value. Almost anything can be a value, so the action method does some ham-handed work to figure out which thing just matched. It looks into the possible submatches to find one with a defined value. Well, that’s pretty stupid even if it’s a quick way to get started (although there is some value in the immediate stupid versus the far-off smart).

A proto rule gets around this by making it easy for you to give different subrules the same name but different patterns. Instead of an alternation you have one token for each:

看看笨重的 value 处理。几乎任何东西都可以是一个值,所以 action 方法会做一些简单的工作来弄清楚哪个东西匹配。它查找可能的子匹配以找到具有定义值的子匹配。好吧,即使这是一个快速入门的方式,这也是非常愚蠢的(虽然在愚蠢的直接智能中存在一些价值)。

proto 规则可以让您轻松地为不同的子规则赋予相同的名称但不同的模式。不是备选分支,而是每个都有一个 token

proto token value { * }
token value:sym<string> { <string> }
token value:sym<number> { <number> }
token value:sym<object> { <object> }
token value:sym<array>  { <array>  }
token value:sym<true>   { <sym>    }
token value:sym<false>  { <sym>    }
token value:sym<null>   { <sym>    }

The first proto rule matches *, which really means it dispatches to another rule in that group. It can dispatch to all of them and find the one that works.

Some of these use the special <sym> subrule in their pattern. This means that the name of the rule is the literal text to match. The proto rule <true> matches the literal text true. You don’t have to type that out in the name and the pattern.

It doesn’t matter which of those matches; the grammar calls each of them $<value>. The superrule only knows that something that is a value matched and that the subrule handled it appropriately. The action class makes the right value and stores it in the Match:

第一个 proto 规则匹配 *,这实际上意味着它将分派给该组中的另一个规则。它可以发送给所有人并找到有效的。

其中一些在其模式中使用特殊的 <sym> 子规则。这意味着规则的名称是要匹配的文字文本。 proto 规则 <true> 匹配文字文本 true。您不必在名称和模式中输入该内容。

哪些匹配无关紧要; grammar 调用每个 $<value>。超级规则只知道值匹配的东西,并且子规则适当地处理它。 action 类生成正确的值并将其存储在 Match 中:

class JSON::Actions {
    method TOP    ($/) { make $<value>.made }
    method object ($/) { make $<string-value-list>.made.hash.item }

    method string-value-list ($/) { make $<string-value>>>.made.flat }
    method string-value      ($/) {
        make $<string>.made => $<value>.made
        }

    method array  ($/) { make $<list>.made.item }
    method list   ($/) { make [ $<value>.map: *.made ] }

    method string     ($/) { make $<uchar>.made || ~$/ }

    method value:sym<number> ($/) { make +$/.Str }
    method value:sym<string> ($/) { make $<string>.made }
    method value:sym<true>   ($/) { make Bool::True  }
    method value:sym<false>  ($/) { make Bool::False }
    method value:sym<null>   ($/) { make Any }
    method value:sym<object> ($/) { make $<object>.made }
    method value:sym<array>  ($/) { make $<array>.made }

    method u_char     ($/) { make $<code_point>.made }
    method code_point ($/) { make chr( (~$/).parse-base(16) ) }
    }

EXERCISE 17.5Implement your own JSON parser (steal all the code you like). Test it against some JSON files to see how well it works. You might like to try the JSON files at [https://github.com/briandfoy/json-acceptance-tests

练习17.5实现自己的 JSON 解析器(窃取你喜欢的所有代码)。针对某些 JSON 文件进行测试,看看它的工作情况。您可能想在 https//github.com/briandfoy/json-acceptance-tests 上尝试 JSON文件。

Parsing CSV

Let’s parse some comma-separated values (CSV) files. These are tricky because there’s no actual standard (despite RFC 4180). Microsoft Excel does it one way but some other producers do it slightly differently.

People often initially go wrong thinking they can merely split the data on a comma character—but that might be part of the literal data in a quoted field. The quote character may also be part of the literal data, but one producer might escape internal quote marks by doubling them, "", while another might use the backslash, \". People often assume they are line-oriented, but some producers allow unescaped (but quoted!) vertical whitespace. If all of that wasn’t bad enough, what do you do if one line has fewer (or more) fields than the other lines?

让我们解析一些逗号分隔值(CSV)文件。这些都很棘手,因为没有实际的标准(尽管有despite RFC 4180 )。 Microsoft Excel 以一种方式实现,但其他一些生产商则略有不同。

最初人们通常认为他们只能按照逗号字符拆分数据 - 但逗号可能是引用字段中字面量数据的一部分。引号字符也可能是字面量数据的一部分,但是有些制作人可能会通过两个双引号 "" 来避免内部引号,而另一个可能会使用反斜杠,\"。人们通常认为它们是面向行的,但是一些制作人允许未转义的(但引起来!)垂直空白。如果所有这些都不够糟糕,如果一行的字段少于(或多于)其他行,你会怎么做?

WARNING 警告 #

Don’t parse CSV files like this. The Text::CSV module not only parses the format but also tries to correct problems as it goes.

不要像这样解析 CSV 文件。 Text::CSV 模块不仅可以解析格式,还可以尝试纠正问题。

Still willing to give it a try? You should find that grammars make most of these concerns tractable:

仍然愿意尝试一下?您应该发现 grammar 使大多数这样的问题易于处理:

  • The ratcheting behavior keeps things simple.
  • You can easily handle balanced openers and closers (i.e., the quoting stuff).
  • A grammar can inherit other grammars, so you can adjust a grammar based on the data instead of writing one grammar that handles all the data.
  • You’ve seen action classes, but you can also have action instances that remember extra non-Match data.
  • There’s a .subparse method that lets you parse chunks so you can handle one record at a time.
  • 棘轮行为使事情变得简单。
  • 您可以轻松地处理平衡的开口和闭合(即引用的东西)。
  • grammar 可以继承其他 grammar,因此您可以根据数据调整 grammar,而不是编写一个处理所有数据的 grammar。
  • 您已经看过 action 类,但你也可以拥有记住额外非匹配数据的 action 实例。
  • 有一个 .subparse 方法,可以让你解析块,这样你就可以一次处理一条记录。

Here’s a simple CSV grammar based off the rules in RFC 4180. It allows for quoted fields and uses "" to escape a literal quote. If a comma, quote, or vertical whitespace appears in the literal data, it must be quoted:

这是一个简单的 CSV grammar,基于 RFC 4180 中的规则。它允许引用的字段并使用 "" 来避免字面量引号。如果字面量数据中出现逗号,引号或垂直空格,则必须引起它:

grammar Grammar::CSV {
    token TOP       { <record>+ }
    token record    { <value>+ % <.separator> \R }
    token separator { <.ws> ',' <.ws> }
    token value     {
        '"'             # quoted
            <( [ <-["]> | <.escaped-quote> ]* )>
        '"'
            |
        <-[",\n\f\r]>+  # non-quoted (no vertical ws)
            |
            ''          # empty
        }

    token escaped-quote { '""' }
    }

class CSV::Actions {
    method record ($/) { make $<value>».made.flat }
    method value ($/)  {
        # undo the double double quote
        make $/.subst( rx/ '""' /, '"', :g )
        }
    }

Try this on entire files. The entire file either satisfies this grammar or doesn’t:

在整个文件上试试这个。整个文件要么满足这个 grammar,要么不满足:

my $data = $filename.IO.slurp;
my $result = Grammar::CSV.parse( $data );

You typically don’t want to parse entire files, though. Let’s fix the first part of that problem. You want to process records as you run into them. Instead of using .parse, which anchors to the end of the text, you can use .subparse, which doesn’t. This means you can parse part of the text then stop.

You can deal with one record at a time. Using .subparse with the record rule gets you the first record and only the first record. The .subparse method always returns a Match, unlike .parse, which only returns a Match when it succeeds. You can’t rely on the type of the object as an indication of success:

但是,您通常不希望解析整个文件。让我们解决这个问题的第一部分。您希望在遇到记录时处理记录。你可以使用 .subparse,而不是使用锚定到文本末尾的 .parse.subparse 不会锚定到文本末尾。这意味着您可以解析部分文本然后停止。

您可以一次处理一条记录。将 .subparserecord 规则一起使用可以获得第一条记录,并且只获得第一条记录。与 .parse 不同,.subparse 方法总是返回一个 Match.parse 方法只在成功时返回一个 Match。你不能依赖对象的类型作为成功的指示:

my $data = $filename.IO.slurp;
my $first_result = Grammar::CSV.subparse(
    $data, :rule('record'), :action(CSV::Actions)
    );
if $first-result { ... }

That works for the first line. Use :c(N) to tell these methods where to start in the Str. You have to know where you want to start. The Match knows how far it got; look in the .from slot:

这适用于第一行。使用 :c(N) 告诉这些方法在 字符串中从哪里开始。你必须知道你想要从哪里开始。Match 知道它进行了多远;看看 .from 插槽:

my $data  = $filename.IO.slurp;

loop {
    state $from = 0;
    my $match = Grammar::CSV.subparse(
        $data,
        :rule('record'),
        :actions(CSV::Actions),
        :c($from)
        );
    last unless $match;

    put "Matched from {$match.from} to {$match.to}";
    $from = $match.to;
    say $match;
    }

This is most of the way to a solution—it fails to go through the entire file if .subparse fails on one record. With some boring monkey work you could fix this to find the start of the next record and restart the parsing, but that’s more than I want to fit in this book.

这是解决方案的大部分方法 - 如果 .subparse 在一条记录上失败,则无法遍历整个文件。使用一些无聊的猴子工作,你可以修复这个问题,找到下一条记录的开始并重新开始解析,但这比我想要适应本书更多。

Adjusting the Grammar #

You thought the problem was solved. Then, someone sent you a file with a slightly different format. Instead of escaping a " by doubling it, the new format uses the backslash.

Now your existing grammar fails to parse. You don’t have a rule that satisfies that type of escape because you didn’t need it for your grammar. As a matter of practice in both patterns and grammars, only match what you should match. Be liberal in what you accept in other ways, such as making a subgrammar to handle the new case:

你以为问题已经解决了。然后,有人给你发送了一个格式略有不同的文件。新格式使用反斜杠,而不是使用两个引号转义 "

现在你现有的 grammar 解析失败。您没有满足该类型的转义的规则,因为您的 grammar 不需要它。作为模式和 grammar 的练习,只匹配你应该匹配的内容。在其他方面随心所欲,例如制作一个子 grammar 来处理新案例:

grammar Grammar::CSV::Backslashed is Grammar::CSV {
    token escaped-quote { '\\"' }
    }

class CSV::Actions::Backslashed is CSV::Actions {
    method value ($/)  { make $/.subst( rx/ '\\"' /, '"', :g ) }
    }

With two grammars, how do you get the one that you need to use? The name interpolation ::($name) comes in handy here:

有两个 grammar,你如何得到你需要使用的那个?::($name) 在这里派上用场:

my %formats;
%formats<doubled> = {
    'file'    => $*SPEC.catfile( <corpus test.csv> ),
    'grammar' => 'Grammar::CSV',
    };
%formats<backslashed> = {
    'file' => $*SPEC.catfile( <corpus test-backslash.csv> ),
    'grammar' => 'Grammar::CSV::Backslashed',
    };

for %formats.values -> $hash {
    $hash<data> = $hash<file>.IO.slurp;
    my $class = (require ::( $hash<grammar> ) );
    my $match = $class.parse( $hash<data> );
    say "{$hash<file>} with {$hash<grammar>} ",
        $match ?? 'parsed' !! 'failed';
    }

The %formats Hash of Hashes stores the filenames and the grammars for them. You can load a grammar and use it to parse the data without the explicit grammar name:

%formats 散列散列存储文件名和 grammar。您可以加载 grammar 并使用它来解析数据而不使用显式的 grammar 名称:

corpus/test.csv with Grammar::CSV parsed
corpus/test-backslash.csv with Grammar::CSV::Backslashed parsed

That mostly solves the problem, although there are plenty of special cases that this doesn’t cover.

这主要解决了这个问题,尽管有很多特殊情况并没有涵盖。

Using Roles in Grammars #

Roles can supply rules and methods that grammars can use. In the previous section you handled different sorts of double-quote escaping through inheritance, where you overrode the rule. You can do the same thing with roles.

A grammar can have methods and subroutines. The way you declare a name with sub, method, or rule tells the language parser (not your grammar!) how to parse the stuff in the Block.

First, adjust the main grammar to have a stub method for <escaped-quote>. This forces something else to define it:

角色可以提供 grammar 可以使用的规则和方法。在上一节中,您通过继承处理了不同类型的双引号转义,其中您重写了规则。你可以用角色做同样的事情。

Grammar 可以有方法和子程序。使用 submethodrule 声明名称的方式告诉语言解析器(而不是 grammar!)如何解析中的东西。

首先,调整主 grammar,使其具有 <escaped-quote> 的存根方法。这迫使别人定义它:

grammar Grammar::CSV {
     token TOP       { <record>+ }
     token record    { <value>+ % <.separator> \R }
     token separator { <.ws> ',' <.ws> }
     token value     {
         '"'             # quoted
             <( [ <-["]> | <.escaped-quote> ]* )>
         '"'
             |
         <-[",\n\f\r]>+  # non-quoted (no vertical ws)
             |
             ''          # empty
         }

     # stub that you must define in a role
     method escaped-quote { !!! }
     }

A role will fill in that stub method. There’s one role for each way to escape the double quote:

角色将填充该存根方法。每种方式都有一个角色来转义双引号:

role DoubledQuote     { token escaped-quote { '""'  } }
role BackslashedQuote { token escaped-quote { '\\"' } }

When it’s time to parse a file you can choose which role you want to use. You can create a new object for Grammar::CSV and apply the appropriate role to it:

在解析文件时,您可以选择要使用的角色。您可以为 Grammar::CSV 创建一个新对象并将适当的角色应用于它:

my $filename   = ...;
my $csv-data   = $filename.IO.slurp;
my $csv-parser = Grammar::CSV.new but DoubledQuote;

Use that object to parse your data:

使用该对象解析数据:

my $match = $csv-parser.parse: $csv-data;
say $match // 'Failed!';

Doing this doesn’t fix the double quotes in the data—a "" stays as a ""—but you can fix that in an action class.

EXERCISE 17.6Adjust the CSV example to use roles instead of inheritance. Create an action class to adjust the escaped double quotes as you run into them. You can start with Grammars/test.csv from the downloads section of the book’s website if you like.

这样做不会修复数据中的双引号 - "" 保留为 "" - 但您可以在 action 类中修复它。

练习17.6 调整 CSV 示例以使用角色而不是继承。创建一个 action 类,以便在遇到它们时调整转义的双引号。如果您愿意,可以从本书网站的下载部分 Grammars/ test.csv 开始。

Summary

Grammars are one of the killer features of the language. You can define complex relationships between patterns and use action classes to run arbitrarily complex code when something matches. You might find that your entire program ends up being one big grammar.

Grammars 是 Raku 语言的杀手级特性之一。您可以定义模式之间的复杂关系,并在匹配时使用 action 类来运行任意复杂的代码。您可能会发现整个程序最终变成一个大的 grammar。