Unconventional: Calling Conventions

Unconventional: Calling Conventions

Modified June 3rd, 2026 by emd22

Introduction

When programming, how can we call a function and ensure that we get back to the same place?

We can save an address to return back to, but the issue with this is that if the called function calls another function, it overwrites our return address, leading to a big mess.

On top of this, we need to be able to transfer values in and out of these functions. How can we do this while being certain that the return address will not be overwritten?

In programming, the method used for calling functions and passing parameters is called the calling convention. Each architecture, operating system, and sometimes even programming languages handle calling functions differently and each have their pros and cons.

To outline some of the problems that can occur when calling functions, here is an example using a pseudo-bytecode:

// This function does not call another function, so therefore
// we do not need to store the previous return address.
// This type of function is called a "leaf function".

FUNCTION Callee ( X, Y )
  ADD X, Y

  // Should the return address be popped into an internal
  // register before the result?

  PUSH RESULT

  // If it is done implicitly with the RETURN instruction,
  // the result will be consumed instead
  RETURN
END FUNCTION

FUNCTION Caller
  // How do we pass the return address to the callee?
  // Should it be pushed before pushing parameters?

  // Push parameters
  PUSH 10
  PUSH 20

  // Or pushed implicitly from the CALL instruction?
  CALL Callee
END FUNCTION

Native code calling conventions

x86 and x64: "We have return addresses at home"

When calling a function, parameters are placed firstly into registers and then spill over onto the stack when the registers run out.

  • For Linux, parameters can be in the rdi, rsi, rdx, rcx, r8, and r9 registers.
  • For Windows, parameters can be in the rcx, rdx, r8, and r9 registers.

When calling a function, The call instruction pushes the return address onto the stack, with the ret popping the last value and passing it to the program counter.

Because of this design, loading values from the stack pointer (rsp) or the base pointer (rbp) need to be offset due to the return address being stored on the stack.

For example:

; int test(int a, int b)
test:
  ; Setup stack frame
  push  rbp
  mov   rbp, rsp

  ; To prevent clobbering the register values from the caller, we need to
  ; preserve the arguments by copying to the predefined stack memory determined
  ; by the caller.
  mov   dword ptr [rbp - 4], edi
  mov   dword ptr [rbp - 8], esi

  ; Load `a` into a 32-bit A register and add with preserved `b`.
  ; Note the offset by 4!
  mov   eax, dword ptr [rbp - 4]
  add   eax, dword ptr [rbp - 8]

  ; Cleanup stack frame and return the addition result.
  pop   rbp
  ret

An easy to make mistake when programming x64 assembler is using a wrong offset when accessing memory. If the value at [rbp] was modified as opposed to [rbp - 4], then the return address would be clobbered and likely leading to an interrupt being thrown.

ARM64: 2 Fast 2 Hard to compile

ARM takes a similar method to x64 when it comes to parameters, with x0 to x7 being used for parameter passing and result regiters, but keeping the additional x9 to x15 to use for scratch registers.

The largest difference is that ARM64 tries to keep commonly modified and accessed values in registers at all times. This leads to the Link Register.

ARM64 also brings in the idea of optimizing **Leaf functions** – functions that do not call other functions – to always use the link register (LR, or X30) to store the return adddress, and forego the stack altogether.

When calling a function that does call another function, the return address and frame pointer are pushed to the stack when setting up the stack frame and popped during destruction. This is often done through two specialized instructions Load Pair (ldp) and Store Pair (stp) to modify both values in one instruction.

Using a register for return addresses greatly reduces latency as there is no need to wait on memory speeds. For small functions called frequently, this speed boost can improve congestion in the MMU and therefore lower clock speeds and reduce power consumption.

For example:

; Leaf function, no need to preserve any values or locations
F_Leaf:
  mov     w0, #5

  ; Return via link register
  ret

F_NotLeaf:
  ; Store x29(Frame Pointer) and x30(Link Register) into stack
  stp     x29, x30, [sp, #-16]!

  ; Setup frame, set frame pointer to current position in stack
  mov     x29, sp

  ; Call leaf function
  bl      F_Leaf

  ; Destroy frame by popping frame pointer and link register
  ; off of the stack
  ldp     x29, x30, [sp], #16
  ret

Scripting Languages: POP Rocks?

Many scripting languages such as Python or Lua take a simple path to dealing with return addresses.

Both Lua and Python push a little chunk of metadata onto an internal queue for each function call. This is popped off when returning, restoring the previous state.

It is important to note that there is no connection between the stack and this specialized internal queue in both scenarios; the queue is managed by the VM and automatically pushes and pops these values.

There are a few (minor) issues with this approach:

  • Missed optimizations – Using a stack over primitive variables (virtualized registers) prevents a compiler from potentially optimizing down to using actual hardware registers.
  • Poor memory locality – The call queues are likely allocated separately from the normal stack, leading to CPU caches needing to fetch parts of the call queue at each setup and destroy of a call frame.
  • Higher memory usage when using recursive functions
  • Fragmentation for dynamically allocated queues

Although this is a good solution for rewinding stack frames and memory, it can be more fragmented and potentially slower than other solutions.

A Specialized Solution

Since my scripting language is very purpose built, I wanted to go for a hybrid approach to reduce complexity in the bytecode compiler and improve on what general purpose scripting languages do.

Since values and refs in the script are 32-bit (scripts should not be above 4 GiB in size anyway lol), and there will be limited recursion depth, we can use a preallocated stack to hold the return addresses. Given we have 2 KiB allocated, that allows for a recursion depth of 512 calls!

Building onto this, I defined the top few KiB of the script's stack would be designated 'system'. This means that return addresses can be pushed here separately, and I can implement checks to ensure that the values cannot be accidentally overwritten by a misplaced pop.

Separately, parameters are pushed onto the lower portion of the script stack as per normal. This means that all of those values are stored closely together, while avoiding offsetting pointers (in the case of x64), dealing with leaf functions (for ARM64), or high memory usage, fragmentation or poor cache locality of other scripting languages.

Conclusion

In conclusion, calling and returning are among the most frequently executed instructions in a computer. Your CPU does a great job of making sure they are fast, and scripting languages have a separate purpose for their calling convention – holding internal states, data and ensuring no memory can be leaked.

If you are designing a custom language, choose what works best for you. Do you need performance? Do you want better debug info? Should there be lists of arguments, variable counts, etc?

Thats it for now,

Ethan