From 228252e7f9bb035681ef9c4708db37b8393327d9 Mon Sep 17 00:00:00 2001
From: Federico Angelilli <code@fedang.net>
Date: Tue, 18 Feb 2025 00:15:25 +0100
Subject: Update draft

---
 content/posts/pointer-tagging/index.md | 130 +++++++++++++++++++++++----------
 1 file changed, 91 insertions(+), 39 deletions(-)

diff --git a/content/posts/pointer-tagging/index.md b/content/posts/pointer-tagging/index.md
index 180e120..7f133e8 100644
--- a/content/posts/pointer-tagging/index.md
+++ b/content/posts/pointer-tagging/index.md
@@ -6,29 +6,38 @@ tags = [ "low-level", "memory", "optimization", "langdev", "interpreter" ]
 draft = true
 +++
 
-Are you curious about the tricks that make some interpreted languages so fast?
+Curious about the tricks that make some interpreted languages so fast?
 Let's analyze and implement **tagged pointers**, a clever optimization
-that can reduce memory footprint and boost performance.
+that can reduce memory footprint and boost performance in interpreters.
 
 <!--more-->
 
-## Beginning with theory
+## Some theory
 
-A *pointer* is a word-sized integer representing a *memory address*.
-In theory, all the bits of a pointer are used as the whole address space is available.
-While this could be the case for older or embedded systems, modern ones leave some bits unused.
+Let's start with the term *pointer*, which I'll use to refer to actual **memory addresses**.
+Going forward, I will assume these to be represented as a word-sized integer.
+All possible pointer values (2 to the power of the bits in a word) form the *virtual address space*.
 
-Let's start from 32-bit architectures (x86, Arm32, etc.).
-They use *32-bit words* (they got their name from that), and can address up to 4GiB (4 × 1024&sup3;) of memory[^pae].
-Since 32-bit computers can use all of the 4 gigs, every bit in the address will be utilized, right? Well, not so fast.
+In principle, all of these addresses are valid and potentially used.
+While this could be the case for older or embedded systems, modern ones won't allocate certain ranges.
+Consequently, some of the bits in a pointer will be unused (for addressing at least).
+
+First, let's see the behaviour of 32-bit architectures (e.g., x86, Arm32).
+They use *32-bit words* (hence their name), and can address up to 4 GB (4 × 1024&sup3; bytes) of memory[^pae].
+Since these systems can use all of the 4 gigs, are the bits in the addresses fully utilized? Well, not necessarily.
 
 This is where **alignment** comes into the picture.
-Processors don't like reading from addresses not divisible by their word width.
-These unaligned accesses can cause performance problems (two separate reads are executed) or even trap.
+Processors don't like reading from addresses not divisible by their word size.
+These unaligned accesses can cause performance problems (multiple reads may be executed) or even trap.
+
+To prevent this, compilers insert padding to align values according to their types' alignment (which can be queried with `alignof`).
+These alignment requirements are specified in an *Application Binary Interface* and are thus platform-dependent.
+For example, the current Linux ABI guarantees the stack to be 16 bytes aligned[^abi].
+
+Similar conventions are also adopted by system libraries.
+Memory allocated with `malloc` is aligned to 8 bytes (16 on 64-bit systems) by both glibc[^malloc] and MSVCRT[^msalloc].
 
-To prevent this, compilers insert padding to align values according to their types (you query it with `alignof`).
-Memory allocated with `malloc` is typically aligned to 8 bytes (16 on 64-bit systems)[^malloc].
-In this case, aligned pointers will always have their bottom 3 bits set to zero.
+On a system where these assumptions hold, our pointers will always have their bottom 3 bits set to zero.
 
 ```
 Top                       Bottom
@@ -44,9 +53,9 @@ massive (theoretical) address space of 16 EB (16 million terabytes!).
 This range is so ludicrously big that processors don't physically support all of it.
 Today, the most common configuration is using *48-bit virtual addresses*, leaving the top 16 bits unused.
 
-Let's see how these bits are handled in various architectures (technical content ahead!).
+Let's see how these bits are handled in common architectures (technical content ahead).
 
-### X86_64
+### x86_64
 
 Canonical addresses on x86_64 must have bits 63 through 48 set to zeros or ones (depending on whether bit 47 is a zero or one)[^intel].
 The Linux kernel uses the bit 47 to distinguish between user and kernel address space[^mmap].
@@ -54,19 +63,19 @@ This means that in user space, bits 63:47 will always be zero.
 
 Linux 4.14 added support for 5-level paging, which allows *57-bit virtual addresses*.
 Everything said above applies, but for bit 56 instead of 47.
-By default, address space above 47-bit will not be allocated for user space[^page5].
+By default, addresses above 47-bit will not be allocated for user space[^page5].
 
 An Intel extension called *Linear Address Masking*, ignores the high bits of pointers[^lam], avoiding explicit masking.
-The most significant bit is reserved for kernel use, allowing metadata in bits 62:57 (or 62:48 in LAM_U48 mode).
+The most significant bit is reserved for kernel use, allowing metadata in bits 62:57 (or 62:48 in `LAM_U48` mode).
 
 A similar feature from AMD called *Upper Address Ignore* has not been added to the kernel due to
 security concerns[^uai].
 It allows the uppermost 7 bits to be ignored, including the most significant bit, which is
 usually reserved for kernel addresses.
 
-### Aarch64
+### ARM64
 
-Aarch64 adopts a slightly different approach.
+The Aarch64 architecture adopts a slightly different approach.
 For 48-bit virtual addresses, bits 63:48 must be either all 0s or all 1s.
 Address space partitioning is independent of bit 47, allowing for a range of 256 TB (instead of the 128 TB on x86_64).
 Kernel addresses have the high bits set to 1s, while user space addresses must have them set to zero[^armptr].
@@ -87,10 +96,10 @@ For example, when using SV48 addressing, bits 63:48 must be equal to bit 47.
 The virtual memory layout[^rvmap] is similar to x86_64, where bit 47 is set for kernel addresses
 and the top bits are zero for user space.
 
-The recently ratified *RISC-V Pointer Masking* proposal[^rvext], adds the `Smmpm`, `Smnpm` and `Ssnpm` extensions.
+The recently ratified *RISC-V Pointer Masking* proposal[^rvext] adds the `Smmpm`, `Smnpm` and `Ssnpm` extensions.
 These can be used to configure the processor to ignore `PMLEN` bits in addresses, starting from the most significant bit.
 
-By setting `PMLEN` to 7 the addresses are masked similarly to AMD's UAI.
+By setting `PMLEN` to 7, the addresses are masked similarly to AMD's UAI.
 The security concerns were addressed in a note, specifying that an appropriate system ABI
 can prevent the usage of the MSB without changes to the masking mechanism.
 
@@ -101,8 +110,9 @@ Let's call the technique that stores metadata in those bits **high bits tagging*
 
 For the sake of simplicity, let's select a single layout to act as a common denominator between the
 various platforms.
-To accommodate x86_64 5-level paging and LAM we can choose to use only bits 62:57 for tagging.
+To accommodate x86_64 5-level paging and LAM, we can choose to use only bits 62:57 for tagging.
 These bits are allowed on RISC-V and also fall inside the range of Arm's TBI feature.
+I'm also assuming that our addresses are aligned to 16 bytes.
 
 Thus, we end up with the following pointer layout (`T` for tag bits).
 
@@ -111,30 +121,44 @@ Top                                                       Bottom
 ▮TTTTTT▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮0000
 ```
 
-## What are the advantages?
+## Why is this useful?
 
-Now that we have understood how pointer tagging works,
-we need to address :wink: *why* we would want to use it.
+Now that we have understood the principles behind pointer tagging,
+we need to address :wink: their advantages.
 
+The stored metadata could be used for many things, including memory safety[^mte2].
+However, I will focus on applications for language development and interpreters.
 
+Before delving into the benefits of tagged pointers, let's see what they are replacing.
+A common approach is to heap allocate everything as an object and use
+pointers as a uniform value representation. This is what CPython does with `PyObject`.
 
+The usual alternative is a *tagged union*, which stores a union of the possible value types alongside
+an integer flag. This means less heap allocations with the cost of a bigger value type (usually 16 bytes).
+Lua is currently using this approach[^lua].
 
+Is there a way to have our cake and eat it too?
+By tagging our pointers, we can decrease heap allocations and have a
+single word as our value representation.
+
+
+while retaining pointers as our value representation by tagging them.
 
-## Implementation
 
 
-### Union
 
 
-## Tagged implementation
 
-### Low bit
 
-### High bit
 
 
+## Technique comparison
 
 
+There are other clever ways to stuff extra information in a pointer.
+A technique called NaN boxing makes use specific bit configurations of the IEEE 754 double precision float
+to store pointers (up to 52 bits).
+
 ## A word of caution
 
 This does not work on funky architectures. FPGA, Xeon Phi
@@ -143,11 +167,23 @@ This does not work on funky architectures. FPGA, Xeon Phi
 While there are no standard guarantees about memory alignment, common architectures like x86 are very consistent.
 
 
-## Alternative approaches
 
-There are other clever ways to stuff extra information in a pointer.
-A technique called NaN boxing makes use specific bit configurations of the IEEE 754 double precision float
-to store pointers (up to 52 bits).
+
+## Implementation
+
+
+### Union
+
+
+## Tagged implementation
+
+### Low bit
+
+### High bit
+
+
+
+
 
 
 ## References
@@ -157,14 +193,19 @@ to store pointers (up to 52 bits).
 - https://wingolog.org/archives/2011/05/18/value-representation-in-javascript-implementations
 - https://en.wikichip.org/wiki/arm/tbi
 - https://mikeash.com/pyblog/friday-qa-2012-07-27-lets-build-tagged-pointers.html
+- https://www.snellman.net/blog/archive/2017-09-04-lisp-numbers
+- https://alwaysprocessing.blog/2023/03/19/objc-tagged-ptr
 
 [^mmap]: https://docs.kernel.org/arch/x86/x86_64/mm.html
 [^rvmap]: https://docs.kernel.org/arch/riscv/vm-layout.html
 [^armmap]: https://docs.kernel.org/arch/arm64/memory.html
 
-[^uai]: https://lwn.net/Articles/888914/
+[^uai]: *Pointer tagging for x86 systems*, \
+    https://lwn.net/Articles/888914/
+
+[^lam]: *Support for Intel's Linear Address Masking*, \
+    https://lwn.net/Articles/902094
 
-[^lam]: https://lwn.net/Articles/902094
 [^armptr]: https://www.kernel.org/doc/Documentation/arm64/tagged-pointers.txt
 
 [^tbi]: https://source.android.com/docs/security/test/tagged-pointers#top-byte-ignore
@@ -182,8 +223,7 @@ to store pointers (up to 52 bits).
 [^riscv]: *The RISC-V Instruction Set Manual: Volume II*, \
     https://drive.google.com/file/d/17GeetSnT5wW3xNuAHI95-SI1gPGd5sJ_/view
 
-[^malloc]: Every libc has some alignment guarantees, for glibc see \
-    https://sourceware.org/glibc/wiki/MallocInternals
+[^malloc]: https://www.gnu.org/software/libc/manual/html_node/Aligned-Memory-Blocks.html
 
 [^pae]: I purposefully didn't mention PAE, see \
     https://www.realworldtech.com/forum/?threadid=76912&curpostid=76973
@@ -192,3 +232,15 @@ to store pointers (up to 52 bits).
     https://cdrdv2.intel.com/v1/dl/getContent/671200
 
 [^rvnote]: https://github.com/riscv/riscv-j-extension/blob/master/zjpm/background.adoc#pointer-masking-and-privilege-modes
+
+[^msalloc]: https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/malloc
+
+[^abi]: *Linux x86 ABI changed; compiler update required*, \
+    https://sourceforge.net/p/fbc/bugs/659
+
+[^mte2]: *How does MTE work?*, \
+    https://developer.arm.com/documentation/108035/0100/How-does-MTE-work-?
+
+[^v8]: https://v8.dev/blog/pointer-compression#value-tagging-in-v8
+
+[^lua]: https://github.com/lua/lua/blob/master/lobject.h
-- 
cgit v1.2.3