基于 LLVM 自制编译器(6)——语言扩展：自定义运算符

发表于 2022-08-28 更新于 2023-09-16 分类于 LLVM Waline：

概述

目前为止，Kaleidoscope 已经是一门功能齐全且有用的编程语言了。但是，它仍然存在一个很大的问题。当前的 Kaleidoscope 缺少很多有用的运算符，比如：取反、比较等。

本章，我们将为 Kaleidoscope 支持自定义运算符，从而让 Kaleidoscope 具备更强大的编程能力。

目标

我们希望为 Kaleidoscope 支持的 运算符重载（Operator Overloading）能够比 C++ 等语言更加通用。在 C++ 中，我们只能够重新定义已存在的运算符，我们不能以编程方式修改其语法，也不能引入新的运算符、修改运算符优先级等。在本章中，我们将为 Kaleidoscope 支持这种更加强大的能力。

目前为止，我们通过手写的方式实现了一个递归下降的解析器，能够解析目前我们所定义的表达式，包括语法、运算符优先级等。通过运算符优先级解析，我们可以非常简单地引入新的运算符。

本文我们将扩展两种运算符：

一元运算符（Unary Operator）
二元运算符（Binary Operator）

如下所示，为 Kaleidoscope 中自定义一元运算符和二元运算符的示例。

# Logical unary not.
def unary!(v)
  if v then
    0
  else
    1;

# Define > with the same precedence as <.
def binary> 10 (LHS RHS)
  RHS < LHS;

# Binary "logical or", (note that it does not "short circuit")
def binary| 5 (LHS RHS)
  if LHS then
    1
  else if RHS then
    1
  else
    0;

# Define = with slightly lower precedence than relationals.
def binary= 9 (LHS RHS)
  !(LHS < RHS | LHS > RHS);

很多语言希望能够在语言本身中实现其标准运行时库。在本章中，我们会在 Kaleidoscope 中将自定义运算符在其标准库中实现。

词法分析器扩展

无论是一元运算符还是二元运算符，对于词法分析器而言，两者只不过是关键词不同，分别是：unary 和 binary。因此，我们为 unary 和 binary 分别定义对应的 token 类型和 token 值，如下所示。

enum Token {
  ...
  // operators
  tok_binary = -11,
  tok_unary = -12
};
...
static int gettok() {
...
    if (IdentifierStr == "for")
      return tok_for;
    if (IdentifierStr == "in")
      return tok_in;
    if (IdentifierStr == "binary")
      return tok_binary;
    if (IdentifierStr == "unary")
      return tok_unary;
    return tok_identifier;

运算符的定义解析

从上述的目标可以看出，一元运算符和二元运算符均使用函数的方式进行定义。与普通函数定义不同的是，一元运算符和二元运算符的函数原型略有不同。因此，我们需要对 PrototypeAST 进行修改，从而支持解析两种类型的运算符。

/// PrototypeAST - This class represents the "prototype" for a function,
/// which captures its argument names as well as if it is an operator.
class PrototypeAST {
  std::string Name;
  std::vector<std::string> Args;
  bool IsOperator;
  unsigned Precedence;  // Precedence if a binary op.

public:
  PrototypeAST(const std::string &name, std::vector<std::string> Args,
               bool IsOperator = false, unsigned Prec = 0)
  : Name(name), Args(std::move(Args)), IsOperator(IsOperator),
    Precedence(Prec) {}

  Function *codegen();
  const std::string &getName() const { return Name; }

  bool isUnaryOp() const { return IsOperator && Args.size() == 1; }
  bool isBinaryOp() const { return IsOperator && Args.size() == 2; }

  char getOperatorName() const {
    assert(isUnaryOp() || isBinaryOp());
    return Name[Name.size() - 1];
  }

  unsigned getBinaryPrecedence() const { return Precedence; }
};

我们在原始的 PrototypeAST 定义的基础上，新增两个属性用于表示普通的函数原型和自定义运算符。

IsOperator 表示是否是运算符。
Precedence 表示运算符的优先级。优先级仅用于二元运算符。

通过修改 PrototypeAST，使其支持识别一元运算符和二元运算符的定义后，我们来进一步实现具体的解析逻辑，如下所示。

/// prototype
///   ::= id '(' id* ')'
///   ::= binary LETTER number? (id, id)
static std::unique_ptr<PrototypeAST> ParsePrototype() {
  std::string FnName;

  unsigned Kind = 0;  // 0 = identifier, 1 = unary, 2 = binary.
  unsigned BinaryPrecedence = 30;

  switch (CurTok) {
  default:
    return LogErrorP("Expected function name in prototype");
  case tok_identifier:
    FnName = IdentifierStr;
    Kind = 0;
    getNextToken();
    break;
 case tok_unary:
    getNextToken();
    if (!isascii(CurTok))
      return LogErrorP("Expected unary operator");
    FnName = "unary";
    FnName += (char)CurTok;
    Kind = 1;
    getNextToken();
    break;
  case tok_binary:
    getNextToken();
    if (!isascii(CurTok))
      return LogErrorP("Expected binary operator");
    FnName = "binary";
    FnName += (char)CurTok;
    Kind = 2;
    getNextToken();

    // Read the precedence if present.
    if (CurTok == tok_number) {
      if (NumVal < 1 || NumVal > 100)
        return LogErrorP("Invalid precedence: must be 1..100");
      BinaryPrecedence = (unsigned)NumVal;
      getNextToken();
    }
    break;
  }

  if (CurTok != '(')
    return LogErrorP("Expected '(' in prototype");

  std::vector<std::string> ArgNames;
  while (getNextToken() == tok_identifier)
    ArgNames.push_back(IdentifierStr);
  if (CurTok != ')')
    return LogErrorP("Expected ')' in prototype");

  // success.
  getNextToken();  // eat ')'.

  // Verify right number of names for operator.
  if (Kind && ArgNames.size() != Kind)
    return LogErrorP("Invalid number of operands for operator");

  return std::make_unique<PrototypeAST>(FnName, std::move(ArgNames), Kind != 0,
                                         BinaryPrecedence);
}

在 ParsePrototype 的解析逻辑中，我们通过解析运算符得到 FnName 变量，将其作为自定义运算符的构建名称，比如：为 @ 操作符构建 binary@ 的名称。由于 LLVM 符号表中的符号允许包含任何字符，所以我们可以将构建名称（如：binary@）存入符号表。

这里，一元运算符和二元运算符的定义的解析逻辑非常相似，唯一的区别在于，二元运算符需要额外解析运算符优先级。

运算符的表达式解析

上一节，我们介绍了如何解析运算符定义。本节，我们来介绍如何解析运算符表达式。

在第 2 章中，我们已经介绍了 Kaleidoscope 中定义的三种类型的 AST 结构，分别是：表达式、原型、函数。其中，我们已经定义了一种表达式子类型 BinaryExprAST，用于表示二元运算符表达式。因此，我们只需要额外定义一种新的表达式子类型 UnaryExprAST 表示一元运算符表达式即可。

UnaryExprAST 的具体定义如下所示。其中，Opcode 表示运算符符号，Operand 表示运算符所作用的操作数。

/// UnaryExprAST - Expression class for a unary operator.
class UnaryExprAST : public ExprAST {
  char Opcode;
  std::unique_ptr<ExprAST> Operand;

public:
  UnaryExprAST(char Opcode, std::unique_ptr<ExprAST> Operand)
    : Opcode(Opcode), Operand(std::move(Operand)) {}

  Value *codegen() override;
};

接下来，我们来分别看有一下二元运算符和一元运算符的代码生成实现逻辑。

二元表达式代码生成

如下所示，为二元表达式实现代码生成的逻辑。我们仅仅是在原有的代码生成逻辑中进行了扩展。

Value *BinaryExprAST::codegen() {
  Value *L = LHS->codegen();
  Value *R = RHS->codegen();
  if (!L || !R)
    return nullptr;

  switch (Op) {
  case '+':
    return Builder.CreateFAdd(L, R, "addtmp");
  case '-':
    return Builder.CreateFSub(L, R, "subtmp");
  case '*':
    return Builder.CreateFMul(L, R, "multmp");
  case '<':
    L = Builder.CreateFCmpULT(L, R, "cmptmp");
    // Convert bool 0/1 to double 0.0 or 1.0
    return Builder.CreateUIToFP(L, Type::getDoubleTy(TheContext),
                                "booltmp");
  default:
    break;
  }

  // If it wasn't a builtin binary operator, it must be a user defined one. Emit
  // a call to it.
  Function *F = getFunction(std::string("binary") + Op);
  assert(F && "binary operator not found!");

  Value *Ops[2] = { L, R };
  return Builder.CreateCall(F, Ops, "binop");
}

在解析二元表达式时，我们需要处理两种情况：

处理 默认运算符，比如：+、-、*、< 等。我们只需要调用 Builder 生成对应的 LLVM IR 即可。
处理 自定义运算符，比如：@、| 等。我们只需要在符号表中查找对应的运算符，并生成对函数（如：binary@）的调用，从而生成 LLVM IR。由于自定义运算符只是作为普通函数构建的，所以不存在额外的特殊逻辑。

一元表达式代码生成

如下所示，为一元表达式实现代码生成的逻辑。

Value *UnaryExprAST::codegen() {
  Value *OperandV = Operand->codegen();
  if (!OperandV)
    return nullptr;

  Function *F = getFunction(std::string("unary") + Opcode);
  if (!F)
    return LogErrorV("Unknown unary operator");

  return Builder.CreateCall(F, OperandV, "unop");
}

其生成逻辑非常简单，通过在符号表中查找对应的运算符，并生成对函数（如：binary!）的调用，从而生成 LLVM IR。

通用表达式解析优化

由于现在我们新增了一种表达式类型 UnaryExprAST，因此，我们需要对原始的通用表达式解析函数进行优化，以支持解析一元表达式。

/// expression
///   ::= unary binoprhs
///
static std::unique_ptr<ExprAST> ParseExpression() {
  auto LHS = ParseUnary();
  if (!LHS)
    return nullptr;

  return ParseBinOpRHS(0, std::move(LHS));
}

ParseExpression 的实现逻辑非常简单，使用 ParseUnary 解析左操作数。然后继续解析结果，再解析右操作数。

由于表达式的右部可能也包含一元表达还是，因此，我们还需要修改 ParseBinOpRHS，支持解析一元运算符，如下所示。

/// binoprhs
///   ::= ('+' unary)*
static std::unique_ptr<ExprAST> ParseBinOpRHS(int ExprPrec,
                                              std::unique_ptr<ExprAST> LHS) {
  ...
    // Parse the unary expression after the binary operator.
    auto RHS = ParseUnary();
    if (!RHS)
      return nullptr;
  ...
}

上述的修改最终都会调用 ParseUnary 解析函数，其具体实现如下所示。

/// unary
///   ::= primary
///   ::= '!' unary
static std::unique_ptr<ExprAST> ParseUnary() {
  // If the current token is not an operator, it must be a primary expr.
  if (!isascii(CurTok) || CurTok == '(' || CurTok == ',')
    return ParsePrimary();

  // If this is a unary operator, read it.
  int Opc = CurTok;
  getNextToken();
  if (auto Operand = ParseUnary())
    return std::make_unique<UnaryExprAST>(Opc, std::move(Operand));
  return nullptr;
}

ParseUnary 解析函数的实现非常简单。当我们在解析主表达式时遇到一元运算符时，我们会将运算符符作为前缀，并将剩余部分作为另一个一元运算符进行解析。这样使得我们能够处理多个一元运算符串联的场景，比如： !!x。注意，一元运算符不能像二元运算符那样具有二义性的解析，因此不需要设置优先级。

运算符函数调用

无论是一元运算符还是二元运算符，最终的实现逻辑均由函数体实现。因此，我们需要通过扩展 FunctionAST 的解析函数 codegen()，从而生成对应的 LLVM IR，具体如下所示。

Function *FunctionAST::codegen() {
  // Transfer ownership of the prototype to the FunctionProtos map, but keep a
  // reference to it for use below.
  auto &P = *Proto;
  FunctionProtos[Proto->getName()] = std::move(Proto);
  Function *TheFunction = getFunction(P.getName());
  if (!TheFunction)
    return nullptr;

  // If this is an operator, install it.
  if (P.isBinaryOp())
    BinopPrecedence[P.getOperatorName()] = P.getBinaryPrecedence();

  // Create a new basic block to start insertion into.
  BasicBlock *BB = BasicBlock::Create(TheContext, "entry", TheFunction);
  
  ...
  
  if (P.isBinaryOp())
    BinopPrecedence.erase(P.getOperatorName());
  return nullptr;

其主要是在为 FunctionAST 生成代码时，判断是否是自定义运算符的函数定义，如果是，则设置运算符优先级，并在代码生成后移除对应的运算符优先级。通过这种方式，我们可以同时实现一元运算符和二元运算符的底层代码生成逻辑。

测试

通过以上简单的扩展，我们开发出了一门真正的语言。基于 Kaleidoscope，我们可以做很多有趣的事情，包括：I/O、数学运算等。比如，我们添加一个串联操作符，如下所示。注意：printd 用于打印指定值或换行符。

ready> extern printd(x);
Read extern:
declare double @printd(double)

ready> def binary : 1 (x y) 0;
Read function definition:
define double @"binary:"(double %x, double %y) {
entry:
  ret double 0.000000e+00
}
ready> printd(123) : printd(456) : printd(789);
123.000000
456.000000
789.000000
Evaluated to 0.000000

我们还可以定义一系列的原始操作符，如下所示。

# Logical unary not.
def unary!(v)
  if v then
    0
  else
    1;

# Unary negate.
def unary-(v)
  0-v;

# Define > with the same precedence as <.
def binary> 10 (LHS RHS)
  RHS < LHS;

# Binary logical or, which does not short circuit.
def binary| 5 (LHS RHS)
  if LHS then
    1
  else if RHS then
    1
  else
    0;

# Binary logical and, which does not short circuit.
def binary& 6 (LHS RHS)
  if !LHS then
    0
  else
    !!RHS;

# Define = with slightly lower precedence than relationals.
def binary = 9 (LHS RHS)
  !(LHS < RHS | LHS > RHS);

# Define ':' for sequencing: as a low-precedence operator that ignores operands
# and just returns the RHS.
def binary : 1 (x y) y;

总结

至此，Kaleidoscope 开始成为一种真实而强大的语言。

通过本章，我们增强了 Kaleidoscope 语言，在库中添加了扩展语言的能力。此时，Kaleidoscope 可以构建各种功能性的应用程序，并且可以调用具有副作用的函数，但它实际上不能定义和改变变量本身。

然而，变量更新是大多数编程语言的一个重要特性。下一章，我们将描述如何在不在前端构建 SSA 的情况下支持变量更新。

参考

Kaleidoscope: Extending the Language: User-defined Operators