Suppose we have the following flights data in a CSV file:
val schema = """`date` STRING, `delay` INT, `distance` INT,
`origin` STRING, `destination` STRING"""
val flights_df = spark.read
.format("csv")
.option("header", "true")
.schema(schema)
.load(csvFile)
The data looks like this:
+--------+-----+--------+------+-----------+
| date|delay|distance|origin|destination|
+--------+-----+--------+------+-----------+
|01011245| 6| 602| ABE| ATL|
|01020600| -8| 369| ABE| DTW|
|01021245| -2| 602| ABE| ATL|
|01020605| -4| 602| ABE| ATL|
|01031245| -4| 602| ABE| ATL|
|01030605| 0| 602| ABE| ATL|
|01041243| 10| 602| ABE| ATL|
|01040605| 28| 602| ABE| ATL|
|01051245| 88| 602| ABE| ATL|
|01050605| 9| 602| ABE| ATL|
+--------+-----+--------+------+-----------+
There are three ways to write this data to a table in Spark.
1. Temporary View (session-scoped and global)
A temporary view is tied to a single SparkSession within a Spark application. In contrast, a global temporary view is visible across multiple SparkSessions within a Spark application.
flights_df.createOrReplaceTempView("flights_view")
A view is essentially just a named query: no data is actually stored. Only when the view is used in a query is the underlying table’s data loaded into memory.
2. Managed Table
Suppose we define this database in Spark:
spark.sql("CREATE DATABASE flights_db")
spark.sql("USE flights_db")
This table will be managed by Spark and it’s data is written to spark-warehouse/flights_db/flights_managed_tbl as a number of Parquet files:
flights_df.write
.mode("overwrite")
.saveAsTable("flights_managed_tbl")
3. External Table
In this case, the data is kept in its original location (the CSV file). Deleting the table therefore doesn’t delete the actual data.
spark.sql(s"""
CREATE TABLE flights_tbl_csv (
date STRING,
delay INT,
distance INT,
origin STRING,
destination STRING
)
USING csv
OPTIONS (
PATH '${csvFile}',
header 'true'
)
""")
Note that the ${csvFile}
has to be the absolute path to the data.
Result
This is what the three table’s metadata looks like:
spark.catalog.listTables().show()
+-------------------+----------+-----------+---------+-----------+
| name| database|description|tableType|isTemporary|
+-------------------+----------+-----------+---------+-----------+
|flights_managed_tbl|flights_db| null| MANAGED| false|
| flights_tbl_csv|flights_db| null| EXTERNAL| false|
| flights_view| null| null|TEMPORARY| true|
+-------------------+----------+-----------+---------+-----------+