Project Icon

parboiled2

Scala 2.12+的高效PEG解析器生成库

parboiled2是一个用于Scala 2.12+的PEG解析器生成库。它在编译时将语法规则转换为JVM字节码,提供轻量高效的文本解析能力。该库具有简洁的DSL、准确的错误报告和高性能,可替代正则表达式,适用于多种解析场景。parboiled2无外部依赖,上手简单,是一个功能强大yet易用的解析工具。

parboiled2 |--| A Macro-Based PEG Parser Generator for Scala 2.12+

.. contents:: Contents of this Document

Introduction

parboiled2 is a Scala 2.12+ library enabling lightweight and easy-to-use, yet powerful, fast and elegant parsing of arbitrary input text. It implements a macro-based parser generator for Parsing Expression Grammars_ (PEGs), which runs at compile time and translates a grammar rule definition (written in an internal Scala DSL) into corresponding JVM bytecode.

PEGs are an alternative to Context-Free Grammars_ (CFGs) for formally specifying syntax, they make a good replacement for regular expressions and have some advantages over the "traditional" way of building parsers via CFGs (like not needing a separate lexer/scanner phase).

parboiled2 is the successor of parboiled 1.x_ , which provides a similar capability (for Scala as well as Java) but does not actually generate a parser. Rather parboiled 1.x_ interprets a rule tree structure (which is also created via an internal DSL) against the input, which results in a much lower parsing performance. For more info on how parboiled 1.x_ and parboiled2 compare see parboiled2 vs. parboiled 1.x. You might also be interested in reading about parboiled2 vs. Scala Parser Combinators and parboiled2 vs. Regular Expressions_.

.. _PEG: .. _Parsing Expression Grammars: https://en.wikipedia.org/wiki/Parsing_expression_grammar .. _Context-Free Grammars: https://en.wikipedia.org/wiki/Context-free_grammar .. _parboiled 1.x: http://parboiled.org

Features

  • Concise, flexible and type-safe DSL for expressing parsing logic
  • Full expressive power of Parsing Expression Grammars_, for effectively dealing with most real-world parsing needs
  • Excellent reporting of parse errors
  • Parsing performance comparable to hand-written parsers
  • Easy to learn and use (just one parsing phase (no lexer code required), rather small API)
  • Light-weight enough to serve as a replacement for regular expressions (also strictly more powerful than regexes)

Installation

The artifacts for parboiled2 live on Maven Central_ and can be tied into your SBT-based Scala project like this:

.. code:: Scala

libraryDependencies += "org.parboiled" %% "parboiled" % "2.5.0"

The latest released version is 2.5.1. It is available for Scala 2.12, 2.13 and 3, Scala.js and Scala Native.

parboiled2 has no external dependencies. (It used to depend on shapeless_, but the few bits it was using have been internalized at some point).

Once on your classpath you can use this single import to bring everything you need into scope:

.. code:: Scala

import org.parboiled2._

.. _Maven Central: https://search.maven.org/search?q=g:org.parboiled .. _shapeless: https://github.com/milessabin/shapeless

Example

This is what a simple parboiled2 parser looks like:

.. code:: Scala

import org.parboiled2._

class Calculator(val input: ParserInput) extends Parser {
  def InputLine = rule { Expression ~ EOI }

  def Expression: Rule1[Int] = rule {
    Term ~ zeroOrMore(
      '+' ~ Term ~> ((_: Int) + _)
    | '-' ~ Term ~> ((_: Int) - _))
  }

  def Term = rule {
    Factor ~ zeroOrMore(
      '*' ~ Factor ~> ((_: Int) * _)
    | '/' ~ Factor ~> ((_: Int) / _))
  }

  def Factor = rule { Number | Parens }

  def Parens = rule { '(' ~ Expression ~ ')' }

  def Number = rule { capture(Digits) ~> (_.toInt) }

  def Digits = rule { oneOrMore(CharPredicate.Digit) }
}

new Calculator("1+1").InputLine.run() // evaluates to `scala.util.Success(2)`

This implements a parser for simple integer expressions like 1+(2-3*4)/5 and runs the actual calculation in-phase with the parser. If you'd like to see it run and try it out yourself check out Running the Examples_.

Quick Start

A parboiled2 parser is a class deriving from org.parboiled2.Parser, which defines one abstract member:

.. code:: Scala

def input: ParserInput

holding the input for the parsing run. Usually it is best implemented as a val parameter in the constructor (as shown in the Example_ above). As you can see from this design you need to (re-)create a new parser instance for every parsing run (parser instances are very lightweight).

The "productions" (or "rules") of your grammar are then defined as simple methods, which in most cases consist of a single call to the rule macro whose argument is a DSL expression_ defining what input the rule is to match and what actions_ to perform.

In order to run your parser against a given input you create a new instance and call run() on the top-level rule, e.g:

.. code:: Scala

val parser = new MyParser(input)
parser.topLevelRule.run() // by default returns a ``scala.util.Try``

For more info on what options you have with regard to accessing the results of a parsing run check out the section on Access to Parser Results_.

.. DSL expression: The Rule DSL .. actions: Parser Actions

How the Parser matches Input

PEG_ parsers are quite easy to understand as they work just like most people without a lot of background in parsing theory would build a parser "by hand": recursive-descent with backtracking. They have only one parsing phase (not two, like most parsers produced by traditional parser generators like ANTLR_), do not require any look-ahead and perform quite well in most real-world scenarios (although they can exhibit exponential runtime for certain pathological languages and inputs).

A PEG_ parser consists of a number of rules that logically form a "tree", with one "root" rule at the top calling zero or more lower-level rules, which can each call other rules and so on. Since rules can also call themselves or any of their parents the rule "tree" is not really a tree but rather a potentially cyclic directed graph, but in most cases the tree structure dominates, which is why its useful to think of it as a tree with potential cycles.

When a rule is executed against the current position in an input buffer it applies its specific matching logic to the input, which can either succeed or fail. In the success case the parser advances the input position (the cursor) and potentially executes the next rule. Otherwise, when the rule fails, the cursor is reset and the parser backtracks in search of another parsing alternative that might succeed.

For example consider this simple parboiled2 rule:

.. code:: Scala

def foo = rule { 'a' ~ ('b' ~ 'c' | 'b' ~ 'd') }

When this rule is confronted with the input abd the parser matches the input in these steps:

  1. Rule foo starts executing, which calls its first sub-rule 'a'. The cursor is at position 0.
  2. Rule 'a' is executed against input position 0, matches (succeeds) and the cursor is advanced to position 1.
  3. Rule 'b' ~ 'c' | 'b' ~ 'd' starts executing, which calls its first sub-rule 'b' ~ 'c'.
  4. Rule 'b' ~ 'c' starts executing, which calls its first sub-rule 'b'.
  5. Rule 'b' is executed against input position 1, matches (succeeds) and the cursor is advanced to position 2.
  6. Rule 'c' is executed against input position 2 and mismatches (fails).
  7. Rule 'b' ~ 'c' | 'b' ~ 'd' notices that its first sub-rule has failed, resets the cursor to position 1 and calls its 2nd sub-rule 'b' ~ 'd'.
  8. Rule 'b' ~ 'd' starts executing, which calls its first sub-rule 'b'.
  9. Rule 'b' is executed against input position 1, matches and the cursor is advanced to position 2.
  10. Rule 'd' is executed against input position 2, matches and the cursor is advanced to position 3.
  11. Rule 'b' ~ 'd' completes successfully, as its last sub-rule has succeeded.
  12. Rule 'b' ~ 'c' | 'b' ~ 'd' completes successfully, as one of its sub-rules has succeeded.
  13. Rule foo completes execution successfully, as its last sub-rule has succeeded. The whole input "abd" was matched and the cursor is left at position 3 (after the last-matched character).

.. _ANTLR: https://www.antlr.org/

The Rule DSL

In order to work with parboiled2 effectively you should understand the core concepts behind its rule DSL, mainly the "Value Stack" and how parboiled2 encodes value stack operations in the Scala type system.

Rule Types and the Value Stack

Apart from the input buffer and the cursor the parser manages another important structure: the "Value Stack". The value stack is a simple stack construct that serves as temporary storage for your Parser Actions. In many cases it is used for constructing an AST during the parsing run but it can also be used for "in-phase" computations (like in the Example_ above) or for any other purpose.

When a rule of a parboiled2 parser executes it performs any combination of the following three things:

  • match input, i.e. advance the input cursor
  • operate on the value stack, i.e. pop values off and/or push values to the value stack
  • perform side-effects

Matching input is done by calling Basic Character Matching_ rules, which do nothing but match input and advance the cursor. Value stack operations (and other potential side-effects) are performed by Parser Actions_.

It is important to understand that rules in parboiled2 (i.e. the rule methods in your parser class) do not directly return some custom value as a method result. Instead, all their consuming and producing values happens as side-effects to the value stack. Thereby the way that a rule interacts with value stack is encoded in the rule's type.

This is the general definition of a parboiled2 rule:

.. code:: Scala

class Rule[-I <: HList, +O <: HList]

This can look scary at first but is really quite simple. An HList is defined by shapeless_ and is essentially a type of list whose element number and element types are statically known at compile time. The I type parameter on Rule encodes what values (the number and types) the rule pops off the value stack and the O type parameter encodes what values (the number and types) the rule then pushes onto the value stack.

Luckily, in most cases, you won't have to work with these types directly as they can either be inferred or you can use one of these predefined aliases:

.. code:: Scala

type Rule0 = RuleN[HNil]
type Rule1[+T] = RuleN[T :: HNil]
type Rule2[+A, +B] = RuleN[A :: B :: HNil]
type RuleN[+L <: HList] = Rule[HNil, L]
type PopRule[-L <: HList] = Rule[L, HNil]

Here is what these type aliases denote:

Rule0 A rule that neither pops off nor pushes to the value stack, i.e. has no effect on the value stack whatsoever. All Basic Character Matching_ rules are of this type.

Rule1[+T] Pushes exactly one value of type T onto the value stack. After Rule0 this is the second-most frequently used rule type.

Rule2[+A, +B] Pushes exactly two values of types A and B onto the value stack.

RuleN[+L <: HList] Pushes a number of values onto the value stack, which correspond to the given L <: HList type parameter.

PopRule[-L <: HList] Pops a number of values off the value stack (corresponding to the given L <: HList type parameter) and does not produce any new value itself.

The rule DSL makes sure that the rule types are properly assembled and carried through your rule structure as you combine Basic Character Matching_ with Rule Combinators and Modifiers_ and Parser Actions_, so as long as you don't write any logic that circumvents the value stack your parser will be completely type-safe and the compiler will be able to catch you if you make mistakes by combining rules in an unsound way.

.. _AST: https://en.wikipedia.org/wiki/Abstract_syntax_tree

Basic Character Matching

The following basic character matching rules are the only way to cause the parser to match actual input and "make progress". They are the "atomic" elements of the rule DSL which are then used by the Rule Combinators and Modifiers_ to form higher-level rules.


implicit def ch(c: Char): Rule0 Char values can be directly used in the rule DSL and match themselves. There is one notable case where you will have to use the explicit ch wrapper: You cannot use the | operator directly on chars as it denotes the built-in Scala binary "or" operator defined on numeric types (Char is an unsigned 16-bit integer). So rather than saying 'a' | 'b' you will have to say ch('a') | 'b'.


implicit def str(s: String): Rule0 String values can be directly used in the rule DSL and match themselves.


implicit def predicate(p: CharPredicate): Rule0 You can use org.parboiled2.CharPredicate values directly in the rule DSL. CharPredicate is an efficient implementation of character sets and already comes with a number pre-defined character classes like CharPredicate.Digit or CharPredicate.LowerHexLetter.


implicit def valueMap[T](m: Map[String, T]): R Values of type Map[String, T] can be directly used in the rule DSL and match any of the given map's keys and push the respective value upon a successful match. The resulting rule type depends on T:

=================== =========================================
``T``               ``R``
=================== =========================================
``Unit``            ``Rule0``
``L <: HList``      ``RuleN[L]`` (pushes all values of ``L``)
``T`` (otherwise)   ``Rule1[T]`` (pushes only one value)
=================== =========================================

def anyOf(chars: String): Rule0 This constructs a Rule0 which matches any of the given strings characters.


def noneOf(chars: String): Rule0 This constructs a Rule0 which matches any single character except the ones in the given string and except EOI.


def ignoreCase(c: Char): Rule0 Matches the given single character case insensitively. Note: The given character must be specified in lower-case! This requirement is currently NOT enforced!


def ignoreCase(s: String): Rule0 Matches the given string of characters case insensitively. Note: The given string must be specified in all lower-case! This requirement is currently NOT enforced!


def ANY: Rule0 Matches any character except EOI (end-of-input).


def EOI: Char The EOI (end-of-input) character, which is a virtual character that the parser "appends" after the last character of the actual input.


def MATCH: Rule0 Matches no character (i.e. doesn't cause the parser to make any progress) but succeeds always. It's the "empty" rule that is mostly used as a neutral element in rule composition.


def MISMATCH[I <: HList, O <: HList]: Rule[I, O] A rule that always fails. Fits any rule signature.


def MISMATCH0: Rule0 Same as MISMATCH but with a clearly defined type. Use it (rather then MISMATCH) if the call site doesn't clearly "dictate" a certain rule type and using MISMATCH therefore gives you a compiler error.

Rule Combinators and Modifiers

Rules can be freely combined/modified with these operations:


a ~ b Two rules a and b can be combined with the ~ operator resulting in a rule that only matches if first a matches and then b matches. The computation of the resulting rule type is somewhat involved. Here is an illustration (using an abbreviated HList notation):

====================== ==================== =========================
a                      b                    a ~ b
====================== ==================== =========================
``Rule[, A]``          ``Rule[, B]``        ``Rule[, A:B]``
``Rule[A:B:C, D:E:F]`` ``Rule[F, G:H]``     ``Rule[A:B:C, D:E:G:H]``
``Rule[A, B:C]``       ``Rule[D:B:C, E:F]`` ``Rule[D:A, E:F]``
``Rule[A, B:C]``       ``Rule[D:C, E:F]``   Illegal if ``D`` != ``B``
====================== ==================== =========================

a | b Two rules a and b can be combined with the | operator to form an "ordered choice" in PEG_ speak. The resulting rule tries to match

项目侧边栏1项目侧边栏2
推荐项目
Project Cover

豆包MarsCode

豆包 MarsCode 是一款革命性的编程助手,通过AI技术提供代码补全、单测生成、代码解释和智能问答等功能,支持100+编程语言,与主流编辑器无缝集成,显著提升开发效率和代码质量。

Project Cover

AI写歌

Suno AI是一个革命性的AI音乐创作平台,能在短短30秒内帮助用户创作出一首完整的歌曲。无论是寻找创作灵感还是需要快速制作音乐,Suno AI都是音乐爱好者和专业人士的理想选择。

Project Cover

白日梦AI

白日梦AI提供专注于AI视频生成的多样化功能,包括文生视频、动态画面和形象生成等,帮助用户快速上手,创造专业级内容。

Project Cover

有言AI

有言平台提供一站式AIGC视频创作解决方案,通过智能技术简化视频制作流程。无论是企业宣传还是个人分享,有言都能帮助用户快速、轻松地制作出专业级别的视频内容。

Project Cover

Kimi

Kimi AI助手提供多语言对话支持,能够阅读和理解用户上传的文件内容,解析网页信息,并结合搜索结果为用户提供详尽的答案。无论是日常咨询还是专业问题,Kimi都能以友好、专业的方式提供帮助。

Project Cover

讯飞绘镜

讯飞绘镜是一个支持从创意到完整视频创作的智能平台,用户可以快速生成视频素材并创作独特的音乐视频和故事。平台提供多样化的主题和精选作品,帮助用户探索创意灵感。

Project Cover

讯飞文书

讯飞文书依托讯飞星火大模型,为文书写作者提供从素材筹备到稿件撰写及审稿的全程支持。通过录音智记和以稿写稿等功能,满足事务性工作的高频需求,帮助撰稿人节省精力,提高效率,优化工作与生活。

Project Cover

阿里绘蛙

绘蛙是阿里巴巴集团推出的革命性AI电商营销平台。利用尖端人工智能技术,为商家提供一键生成商品图和营销文案的服务,显著提升内容创作效率和营销效果。适用于淘宝、天猫等电商平台,让商品第一时间被种草。

Project Cover

AIWritePaper论文写作

AIWritePaper论文写作是一站式AI论文写作辅助工具,简化了选题、文献检索至论文撰写的整个过程。通过简单设定,平台可快速生成高质量论文大纲和全文,配合图表、参考文献等一应俱全,同时提供开题报告和答辩PPT等增值服务,保障数据安全,有效提升写作效率和论文质量。

投诉举报邮箱: service@vectorlightyear.com
@2024 懂AI·鲁ICP备2024100362号-6·鲁公网安备37021002001498号