提取文本块儿的 5 种方法
— 焉知非鱼假设有一段文本, =begin code
和 =end code
把文本分割为一个一个的 section, 我想提取每一个 section 之间的内容。 Grammar 来拯救!
my $excerpt = q:to/END/;
Here's some unimportant text.
=begin code
This code block is what we're after.
We'll use 'ff' to get it.
=end code
More unimportant text.
=begin code
I want this line.
and this line as well.
HaHa
=end code
More unimport text.
=begin code
Let's to go home.
=end code
END
Grammar #
#use Grammar::Tracer;
#use Grammar::Debugger;
grammar ExtractSection {
rule TOP { ^ <section>+ %% <.comment> $ }
token section { <line>+ % <.ws> }
token line { <?!before <comment>> \N+ \n }
token comment { ['=begin code' | '=end code' ] \n }
}
class ExtractSectionAction {
method TOP($/) { make $/.values».ast }
method section($/) { make ~$/.trim }
method line($/) { make ~$/.trim }
method comment($/) { make Empty }
}
my $em = ExtractSection.parse($excerpt, :actions(ExtractSectionAction)).ast;
for @$em -> $line {
say $line;
say '-' x 35;
}
输出:
Here's some unimportant text.
-----------------------------------
This code block is what we're after.
We'll use 'ff' to get it.
-----------------------------------
More unimportant text.
-----------------------------------
I want this line.
and this line as well.
HaHa
-----------------------------------
More unimport text.
-----------------------------------
Let's to go home.
-----------------------------------
但是这样会把不相关的行包含进来, Brad Gilbert 建议这样写:
#use Grammar::Tracer;
#use Grammar::Debugger;
grammar ExtractSection {
token start { ^^ '=begin code' \n }
token finish { ^^ '=end code' \n }
token line { ^^ \N+)> \n }
token section { <start> ~ <finish> <line>+? }
token comment { ^^ \N+ \n }
token TOP { [<section> || <comment>]+ }
}
class ExtractSectionAction {
method TOP($/) { make @<section>».ast.List }
method section($/) { make ~«@<line>.List }
method line($/) { make ~$/.trim }
method comment($/) { make Empty }
}
my $em = ExtractSection.parse($excerpt, :actions(ExtractSectionAction)).ast;
for @$em -> $line {
say $line.perl;
say '-' x 35;
}
输出:
$("This code block is what we're after.", "We'll use 'ff' to get it.")
-----------------------------------
$("I want this line.", "and this line as well.", "HaHa")
-----------------------------------
$("Let's to go home.",)
-----------------------------------
这样就可以遍历每一个 section, 然后进行所需要操作了。这个比较出彩的地方是使用了 ~。优秀!
rotor #
既然是结构化的文本,那么保存到数组里也是结构化的, 那可以使用 rotor 来做哦:
my @sections =
gather for $excerpt.lines -> $line {
if $line ~~ /'=begin code'/ ff $line ~~ /'end code'/ {
take $line.trim;
}
}
my @idx = # gather take the indices of every `=begin code` and `=end code`
gather for @sections.kv -> $k, $v {
if $v ~~ /'=begin code'/ or $v ~~ /'end code'/ {
take $k;
}
}
my @r = # gather take the lines except every line of `=begin code` and `=end code`
gather for @sections.kv -> $k, $v {
if $v !~~ /'=begin code' | '=end code'/ {
take $v;
}
}
my @counts = @idx.rotor(2)».minmax».elems »-» 2;
say @r.rotor(|@counts).perl;
输出:
(("This code block is what we're after.", "We'll use 'ff' to get it."), ("I want this line.", "and this line as well.", "HaHa"), ("Let's to go home.",)).Seq
也很优秀!
迭代 #
另一种方法是 reddit 上 copy 过来的, 使用了迭代器, 没看懂, 感觉也很优秀!
sub doSomething(Iterator $iter) {
my @lines = [];
my $item := $iter.pull-one;
until ($item =:= IterationEnd || $item.Str ~~ / '=end code' /) {
@lines.push($item);
$item := $iter.pull-one;
}
say "Got @lines[]";
}
my Iterator $iter = $excerpt.lines.iterator;
my $item := $iter.pull-one;
until ($item =:= IterationEnd) {
if ($item.Str ~~ / '=begin code' /) {
doSomething($iter);
}
$item := $iter.pull-one;
}
comb #
对于多行字符串的匹配, 使用 ^^
和 $$
锚定行的开头和结尾。 <(
之前的内容参与匹配, 但不会被捕获到 Match 对象中, )>
之后的内容参与匹配, 但是不会被捕获到 Match 对象中。
这保证了 comb
中的正则只过滤出我们感兴趣的行:
for $excerpt.comb(/^^ '=begin code' $$ \s* <( .+? )> \s+ ^^ '=end code' $$/) -> $c {
say $c;
say '-' x 15;
}
输出:
This code block is what we're after.
We'll use 'ff' to get it.
---------------
I want this line.
and this line as well.
HaHa
---------------
Let's to go home.
---------------
参考: https://stackoverflow.com/questions/49280568/the-use-of-flip-flop-operator-in-perl-6/