提取文本块儿的 5 种方法

假设有一段文本, =begin code=end code 把文本分割为一个一个的 section, 我想提取每一个 section 之间的内容。 Grammar 来拯救!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
my $excerpt = q:to/END/;
Here's some unimportant text.
=begin code
This code block is what we're after.
We'll use 'ff' to get it.
=end code
More unimportant text.
=begin code
I want this line.
and this line as well.
HaHa
=end code
More unimport text.
=begin code
Let's to go home.
=end code
END

Grammar

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#use Grammar::Tracer;
#use Grammar::Debugger;

grammar ExtractSection {
rule TOP { ^ <section>+ %% <.comment> $ }
token section { <line>+ % <.ws> }
token line { <?!before <comment>> \N+ \n }
token comment { ['=begin code' | '=end code' ] \n }

}

class ExtractSectionAction {
method TOP($/) { make $/.values».ast }
method section($/) { make ~$/.trim }
method line($/) { make ~$/.trim }
method comment($/) { make Empty }
}

my $em = ExtractSection.parse($excerpt, :actions(ExtractSectionAction)).ast;

for @$em -> $line {
say $line;
say '-' x 35;
}

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Here's some unimportant text.
-----------------------------------
This code block is what we're after.
We'll use 'ff' to get it.
-----------------------------------
More unimportant text.
-----------------------------------
I want this line.
and this line as well.
HaHa
-----------------------------------
More unimport text.
-----------------------------------
Let's to go home.
-----------------------------------

但是这样会把不相关的行包含进来, Brad Gilbert 建议这样写:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#use Grammar::Tracer;
#use Grammar::Debugger;

grammar ExtractSection {
token start { ^^ '=begin code' \n }
token finish { ^^ '=end code' \n }
token line { ^^ \N+)> \n }
token section { <start> ~ <finish> <line>+? }
token comment { ^^ \N+ \n }
token TOP { [<section> || <comment>]+ }
}

class ExtractSectionAction {
method TOP($/) { make @<section>».ast.List }
method section($/) { make ~«@<line>.List }
method line($/) { make ~$/.trim }
method comment($/) { make Empty }
}

my $em = ExtractSection.parse($excerpt, :actions(ExtractSectionAction)).ast;

for @$em -> $line {
say $line.perl;
say '-' x 35;
}

输出:

1
2
3
4
5
6
$("This code block is what we're after.", "We'll use 'ff' to get it.")
-----------------------------------
$("I want this line.", "and this line as well.", "HaHa")
-----------------------------------
$("Let's to go home.",)
-----------------------------------

这样就可以遍历每一个 section, 然后进行所需要操作了。这个比较出彩的地方是使用了 ~。优秀!

rotor

既然是结构化的文本,那么保存到数组里也是结构化的, 那可以使用 rotor 来做哦:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
my @sections =
gather for $excerpt.lines -> $line {
if $line ~~ /'=begin code'/ ff $line ~~ /'end code'/ {
take $line.trim;
}
}


my @idx = # gather take the indices of every `=begin code` and `=end code`
gather for @sections.kv -> $k, $v {
if $v ~~ /'=begin code'/ or $v ~~ /'end code'/ {
take $k;
}
}

my @r = # gather take the lines except every line of `=begin code` and `=end code`
gather for @sections.kv -> $k, $v {
if $v !~~ /'=begin code' | '=end code'/ {
take $v;
}
}

my @counts = @idx.rotor(2)».minmax».elems »-» 2;
say @r.rotor(|@counts).perl;

输出:

1
(("This code block is what we're after.", "We'll use 'ff' to get it."), ("I want this line.", "and this line as well.", "HaHa"), ("Let's to go home.",)).Seq

也很优秀!

迭代

另一种方法是 reddit 上 copy 过来的, 使用了迭代器, 没看懂, 感觉也很优秀!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
sub doSomething(Iterator $iter) { 
my @lines = [];
my $item := $iter.pull-one;
until ($item =:= IterationEnd || $item.Str ~~ / '=end code' /) {
@lines.push($item);
$item := $iter.pull-one;
}
say "Got @lines[]";
}
my Iterator $iter = $excerpt.lines.iterator;
my $item := $iter.pull-one;
until ($item =:= IterationEnd) {
if ($item.Str ~~ / '=begin code' /) {
doSomething($iter);
}
$item := $iter.pull-one;
}

comb

对于多行字符串的匹配, 使用 ^^$$ 锚定行的开头和结尾。 <( 之前的内容参与匹配, 但不会被捕获到 Match 对象中, )> 之后的内容参与匹配, 但是不会被捕获到 Match 对象中。
这保证了 comb 中的正则只过滤出我们感兴趣的行:

1
2
3
4
for $excerpt.comb(/^^ '=begin code' $$ \s* <( .+? )> \s+ ^^ '=end code' $$/) -> $c {
say $c;
say '-' x 15;
}

输出:

1
2
3
4
5
6
7
8
9
This code block is what we're after.
We'll use 'ff' to get it.
---------------
I want this line.
and this line as well.
HaHa
---------------
Let's to go home.
---------------

参考: https://stackoverflow.com/questions/49280568/the-use-of-flip-flop-operator-in-perl-6/