Spark SQL and Datasets

Question Click to View Answer

Create a Dork class and make a list of the dorks in the table below. Use the list of Dork objects to make a Dataset of the dorks.

+----------+------------+------------+
|   name   |  birthYear | specialty  |
+----------+------------+------------+
| stallman |       1953 | programmer |
| newton   |       1643 | physics    |
| frink    |       1965 | professor  |
+----------+------------+------------+

A Dork class is created and a list of Dork objects is created. The dorks list is converted to a RDD and then converted to a Dataset with the toDF method.

case class Dork(name: String, birthYear: Int, specialty: String)
val dorks = List(
  Dork("stallman", 1953, "programmer"),
  Dork("newton", 1643, "physics"),
  Dork("frink", 1965, "professor")
)
var dorkDs: org.apache.spark.sql.Dataset[Dork] = spark.createDataset(dorks)

The rest of the questions in this quiz will depend on the dorkDs.

Display the contents of the dorkDs Dataset.

The show method is used to display the contents of a Dataset.

dorkDs.show()

Print the schema of the dorkDs.

dorkDs.printSchema()

Datasets have defined schemas (unlike RDDs, which do not have defined schemas). The defined schemas of Datasets allow them to be queried and joined, similar to tables in a relational database.

Create a youngDorkDs Dataset that contains all the dorks that were born after 1900.

The filter method is used to create a new youngDorkDs Dataset that is a subset of the original dorkDs Dataset.

val youngDorkDs = dorkDs.filter {
  dorkDs("birthYear") > 1900
}

The birthYear column can also be referenced with this syntax:

val youngDorkDs = dorkDs.filter {
  $"birthYear" > 1900
}

Create a Dataset called dorkDs that adds a cool column which contains the dork's name, followed by the string "is cool".

A coolify function takes a string as an argument and returns another string. The coolifyUdf function (udf stands for user defined function) defines a Spark SQL column based function to transform a Dataset.

The withColumn function is then used in conjunction with coolifyUdf to create a new Dataset (called DorkDs) with the column cool.

def coolify(name: String): String = {
  s"$name is cool"
}

val coolifyUdf = udf[String, String](coolify)

val dorkDs1 = dorkDs.withColumn("cool", coolifyUdf($"name"))

dorkDs1.show()

Use the sql method to create a Dataset with the following data:

+----------+
|   name   |
+----------+
| stallman |
| newton   |
| frink    |
+----------+

The createOrReplaceTempView method is used to convert the Dataset to a SQL table that can be queried with Spark SQL. The sqlContext.sql method is used to query the table and return another Dataset.

dorkDs.createOrReplaceTempView("dorks")
val dorkNamesDs = spark.sql("select name from dorks")

The select method can be used to achieve the same result more concisely.

val dorkNamesDs = dorkDs.select("name")

Sort the dorkDs by the name column.

val sortedDorkDs = dorkDs.orderBy("name")

You can achieve the same result the SQL way.

dorkDs.createOrReplaceTempView("dorks")
val sortedDorkDs = spark.sql("select * from dorks order by name")

What type of object does dorkDs.first return? Explain the result.

dorkDs.first returns a Dork object. A Dataset is a collection of objects.

What type of object does dorkDs("name") return? Explain the result.

dorkDs("name") returns a org.apache.spark.sql.Column object.

What does $"whatever" return?

$"whatever" returns a org.apache.spark.sql.ColumnName object.