Today I scratched the surface of Spark memory management. Spark computes completely in-memory (RAM). The RAM memory is organized as follows:

RAM
├── JVM Process
│   ├── Heap Memory (On-Heap) ← JVM Objects live here
│   │   ├── Young Generation
│   │   └── Old Generation  
│   ├── Stack Memory
│   └── Method Area
└── Off-Heap Memory ← Tungsten binary data lives here

As I dive into how data is represented in the various memory and serialized and deserialized (SerDe), I noted a couple of questions that tie concepts together:

  1. How are Dataset encoders different from existing Java encoders?
  2. What do JVM binary formats (optimized for OOP) and Tungsten binary formats (optimized for data processing) look like for a particular example?
  3. Why is Spark’s internal Tungsten format “row-based” and “contiguous”?
  4. When data traverses between compute nodes, is it serialized to JVM objects or does it stay in Tungsten’s binary format?
  5. Which coding patterns cause excessive serialization and deserialization and how to avoid them?

There’s a lot to unpack here and I’m going to work through these questions slowly over the coming days (weeks?).