Tungsten encodes rows (from datasets and dataframes) in the UnsafeRow format. Initially, I asked Claude Sonnet 4 to explain how this encoding works. I lost over half an hour making sense of what was plain wrong output1 - showing again the dangers of leveraging AI for learning…

Luckily, I found a good article written by a human that clearly explains the rows.

With my new knowledge, I’ll try to answer some of the questions I asked in the previous mail:

How are Dataset encoders different from existing Java encoders?

They encode to and decode from different binary formats (in memory). Tungsten encodes rows into the UnsafeRow datastructure.

What do JVM binary formats (optimized for OOP) and Tungsten binary formats (optimized for data processing) look like for a particular example?

One advantage of Tungsten’s UnsafeRow encoding is that they divide the memory address space in chunks (words) of 8 bytes. This makes it possible to compare raw bytes when, for example, filtering rows. This means that rows don’t have to be deserialized and is thus much faster. Also, because Tungsten manages objects in off-heap memory, it doesn’t depend on the JVM and is thus not negatively impacted by its garbage collection.

Why is Spark’s internal Tungsten format “row-based” and “contiguous”?

Tungsten’s off-heap memory blocks are stored in one single location where are parts are stitched together (contiguous).


  1. The AI thought that the byte-length of the Null bitmap region and the fixed-value region words was 4 bytes. They’re both 8 bytes. ↩︎