提取文本块儿的 5 种方法

Sat Jan 19, 2019

假设有一段文本, =begin code 和 =end code 把文本分割为一个一个的 section, 我想提取每一个 section 之间的内容。 Grammar 来拯救！

my $excerpt = q:to/END/;
Here's some unimportant text.
=begin code
This code block is what we're after.
We'll use 'ff' to get it.
=end code
More unimportant text.
=begin code
I want this line.
and this line as well.
HaHa
=end code
More unimport text.
=begin code
Let's to go home.
=end code
END

Grammar #

#use Grammar::Tracer;
#use Grammar::Debugger;

grammar ExtractSection {
    rule TOP      { ^ <section>+ %% <.comment> $      }
    token section { <line>+ % <.ws>                   }
    token line    { <?!before <comment>> \N+ \n       }  
    token comment { ['=begin code' | '=end code' ] \n }
    
}

class ExtractSectionAction {
    method TOP($/)      { make $/.values».ast }
    method section($/)  { make ~$/.trim       }
    method line($/)     { make ~$/.trim       }
    method comment($/)  { make Empty          }
}

my $em = ExtractSection.parse($excerpt, :actions(ExtractSectionAction)).ast;

for @$em -> $line {
    say $line;
    say '-' x 35;
}

输出：

Here's some unimportant text.
-----------------------------------
This code block is what we're after.
We'll use 'ff' to get it.
-----------------------------------
More unimportant text.
-----------------------------------
I want this line.
and this line as well.
HaHa
-----------------------------------
More unimport text.
-----------------------------------
Let's to go home.
-----------------------------------

但是这样会把不相关的行包含进来, Brad Gilbert 建议这样写:

#use Grammar::Tracer;
#use Grammar::Debugger;

grammar ExtractSection {
  token start   { ^^ '=begin code' \n          }
  token finish  { ^^ '=end code' \n            }
  token line    { ^^ \N+)> \n                  }
  token section { <start> ~ <finish> <line>+?  }
  token comment { ^^ \N+ \n                    }
  token TOP     { [<section> || <comment>]+    } 
}

class ExtractSectionAction {
    method TOP($/)     { make @<section>».ast.List }
    method section($/) { make ~«@<line>.List       }
    method line($/)    { make ~$/.trim             }
    method comment($/) { make Empty                }
}

my $em = ExtractSection.parse($excerpt, :actions(ExtractSectionAction)).ast;

for @$em -> $line {
    say $line.perl;
    say '-' x 35;
}

输出：

$("This code block is what we're after.", "We'll use 'ff' to get it.")
-----------------------------------
$("I want this line.", "and this line as well.", "HaHa")
-----------------------------------
$("Let's to go home.",)
-----------------------------------

这样就可以遍历每一个 section, 然后进行所需要操作了。这个比较出彩的地方是使用了 ~。优秀！

rotor #

既然是结构化的文本，那么保存到数组里也是结构化的, 那可以使用 rotor 来做哦:

my @sections =
gather for $excerpt.lines -> $line {
    if $line ~~ /'=begin code'/ ff $line ~~ /'end code'/  {  
      take $line.trim;
    }
}


my @idx = # gather take the indices of every `=begin code` and `=end code`
gather for @sections.kv -> $k, $v {
    if $v ~~ /'=begin code'/ or $v ~~ /'end code'/ {
        take $k;
    }
}

my @r = # gather take the lines except every line of `=begin code` and `=end code`
gather for @sections.kv -> $k, $v {
    if $v !~~ /'=begin code' | '=end code'/  {
        take $v;
    }
}

my @counts = @idx.rotor(2)».minmax».elems »-» 2;
say @r.rotor(|@counts).perl;

输出：

(("This code block is what we're after.", "We'll use 'ff' to get it."), ("I want this line.", "and this line as well.", "HaHa"), ("Let's to go home.",)).Seq

也很优秀！

迭代 #

另一种方法是 reddit 上 copy 过来的, 使用了迭代器, 没看懂, 感觉也很优秀！

sub doSomething(Iterator $iter) { 
    my @lines = [];
    my $item := $iter.pull-one;
    until ($item =:= IterationEnd || $item.Str ~~ / '=end code' /) {
       @lines.push($item);
       $item := $iter.pull-one;
    }
    say "Got @lines[]";
}
my Iterator $iter = $excerpt.lines.iterator;
my $item := $iter.pull-one;
until ($item =:= IterationEnd) {
    if ($item.Str ~~ / '=begin code' /) {
       doSomething($iter);
    }
    $item := $iter.pull-one;
}

comb #

对于多行字符串的匹配, 使用 ^^ 和 $$ 锚定行的开头和结尾。 <( 之前的内容参与匹配, 但不会被捕获到 Match 对象中, )> 之后的内容参与匹配, 但是不会被捕获到 Match 对象中。这保证了 comb 中的正则只过滤出我们感兴趣的行:

for $excerpt.comb(/^^ '=begin code' $$ \s* <( .+? )> \s+ ^^ '=end code' $$/) -> $c {
    say $c;
    say '-' x 15;
}

输出:

This code block is what we're after.
We'll use 'ff' to get it.
---------------
I want this line.
and this line as well.
HaHa
---------------
Let's to go home.
---------------

参考： https://stackoverflow.com/questions/49280568/the-use-of-flip-flop-operator-in-perl-6/