Suppose we have the following flights data in a CSV file:

val schema = """`date` STRING, `delay` INT, `distance` INT,
  `origin` STRING, `destination` STRING"""

val flights_df = spark.read
  .format("csv")
  .option("header", "true")
  .schema(schema)
  .load(csvFile)

The data looks like this:

+--------+-----+--------+------+-----------+
|    date|delay|distance|origin|destination|
+--------+-----+--------+------+-----------+
|01011245|    6|     602|   ABE|        ATL|
|01020600|   -8|     369|   ABE|        DTW|
|01021245|   -2|     602|   ABE|        ATL|
|01020605|   -4|     602|   ABE|        ATL|
|01031245|   -4|     602|   ABE|        ATL|
|01030605|    0|     602|   ABE|        ATL|
|01041243|   10|     602|   ABE|        ATL|
|01040605|   28|     602|   ABE|        ATL|
|01051245|   88|     602|   ABE|        ATL|
|01050605|    9|     602|   ABE|        ATL|
+--------+-----+--------+------+-----------+

There are three ways to write this data to a table in Spark.

1. Temporary View (session-scoped and global)

A temporary view is tied to a single SparkSession within a Spark application. In contrast, a global temporary view is visible across multiple SparkSessions within a Spark application.

flights_df.createOrReplaceTempView("flights_view")

A view is essentially just a named query: no data is actually stored. Only when the view is used in a query is the underlying table’s data loaded into memory.

2. Managed Table

Suppose we define this database in Spark:

spark.sql("CREATE DATABASE flights_db")
spark.sql("USE flights_db")

This table will be managed by Spark and it’s data is written to spark-warehouse/flights_db/flights_managed_tbl as a number of Parquet files:

flights_df.write
  .mode("overwrite")
  .saveAsTable("flights_managed_tbl")

3. External Table

In this case, the data is kept in its original location (the CSV file). Deleting the table therefore doesn’t delete the actual data.

spark.sql(s"""
  CREATE TABLE flights_tbl_csv (
    date STRING,
    delay INT,
    distance INT,
    origin STRING,
    destination STRING
  )
  USING csv
  OPTIONS (
    PATH '${csvFile}',
    header 'true'
  )
""")

Note that the ${csvFile} has to be the absolute path to the data.

Result

This is what the three table’s metadata looks like:

spark.catalog.listTables().show()

+-------------------+----------+-----------+---------+-----------+
|               name|  database|description|tableType|isTemporary|
+-------------------+----------+-----------+---------+-----------+
|flights_managed_tbl|flights_db|       null|  MANAGED|      false|
|    flights_tbl_csv|flights_db|       null| EXTERNAL|      false|
|       flights_view|      null|       null|TEMPORARY|       true|
+-------------------+----------+-----------+---------+-----------+

1. Temporary View (session-scoped and global)¶

2. Managed Table¶

3. External Table¶

Result¶

1. Temporary View (session-scoped and global)

2. Managed Table

3. External Table

Result