Today I scratched the surface of Spark memory management. Spark computes completely in-memory (RAM). The RAM memory is organized as follows:
RAM
├── JVM Process
│ ├── Heap Memory (On-Heap) ← JVM Objects live here
│ │ ├── Young Generation
│ │ └── Old Generation
│ ├── Stack Memory
│ └── Method Area
└── Off-Heap Memory ← Tungsten binary data lives here
As I dive into how data is represented in the various memory and serialized and deserialized (SerDe), I noted a couple of questions that tie concepts together:
- How are Dataset encoders different from existing Java encoders?
- What do JVM binary formats (optimized for OOP) and Tungsten binary formats (optimized for data processing) look like for a particular example?
- Why is Spark’s internal Tungsten format “row-based” and “contiguous”?
- When data traverses between compute nodes, is it serialized to JVM objects or does it stay in Tungsten’s binary format?
- Which coding patterns cause excessive serialization and deserialization and how to avoid them?
There’s a lot to unpack here and I’m going to work through these questions slowly over the coming days (weeks?).