Spark SQL and Datasets

Question	Click to View Answer
Create a `Dork` class and make a list of the dorks in the table below. Use the list of `Dork` objects to make a Dataset of the dorks. +----------+------------+------------+ \| name \| birthYear \| specialty \| +----------+------------+------------+ \| stallman \| 1953 \| programmer \| \| newton \| 1643 \| physics \| \| frink \| 1965 \| professor \| +----------+------------+------------+	A `Dork` class is created and a list of `Dork` objects is created. The `dorks` list is converted to a RDD and then converted to a Dataset with the `toDF` method. case class Dork(name: String, birthYear: Int, specialty: String) val dorks = List( Dork("stallman", 1953, "programmer"), Dork("newton", 1643, "physics"), Dork("frink", 1965, "professor") ) var dorkDs: org.apache.spark.sql.Dataset[Dork] = spark.createDataset(dorks) The rest of the questions in this quiz will depend on the `dorkDs`.
Display the contents of the `dorkDs` Dataset.	The `show` method is used to display the contents of a Dataset. dorkDs.show()
Print the schema of the `dorkDs`.	dorkDs.printSchema() Datasets have defined schemas (unlike RDDs, which do not have defined schemas). The defined schemas of Datasets allow them to be queried and joined, similar to tables in a relational database.
Create a `youngDorkDs` Dataset that contains all the dorks that were born after 1900.	The `filter` method is used to create a new `youngDorkDs` Dataset that is a subset of the original `dorkDs` Dataset. val youngDorkDs = dorkDs.filter { dorkDs("birthYear") > 1900 } The `birthYear` column can also be referenced with this syntax: val youngDorkDs = dorkDs.filter { $"birthYear" > 1900 }
Create a Dataset called `dorkDs` that adds a `cool` column which contains the dork's name, followed by the string "is cool".	A `coolify` function takes a string as an argument and returns another string. The `coolifyUdf` function (udf stands for user defined function) defines a Spark SQL column based function to transform a Dataset. The `withColumn` function is then used in conjunction with `coolifyUdf` to create a new Dataset (called `DorkDs`) with the column cool. def coolify(name: String): String = { s"$name is cool" } val coolifyUdf = udf[String, String](coolify) val dorkDs1 = dorkDs.withColumn("cool", coolifyUdf($"name")) dorkDs1.show()
Use the `sql` method to create a Dataset with the following data: +----------+ \| name \| +----------+ \| stallman \| \| newton \| \| frink \| +----------+	The `createOrReplaceTempView` method is used to convert the Dataset to a SQL table that can be queried with Spark SQL. The `sqlContext.sql` method is used to query the table and return another Dataset. dorkDs.createOrReplaceTempView("dorks") val dorkNamesDs = spark.sql("select name from dorks") The `select` method can be used to achieve the same result more concisely. val dorkNamesDs = dorkDs.select("name")
Sort the `dorkDs` by the name column.	val sortedDorkDs = dorkDs.orderBy("name") You can achieve the same result the SQL way. dorkDs.createOrReplaceTempView("dorks") val sortedDorkDs = spark.sql("select * from dorks order by name")
What type of object does `dorkDs.first` return? Explain the result.	`dorkDs.first` returns a `Dork` object. A Dataset is a collection of objects.
What type of object does `dorkDs("name")` return? Explain the result.	`dorkDs("name")` returns a `org.apache.spark.sql.Column` object.
What does `$"whatever"` return?	`$"whatever"` returns a `org.apache.spark.sql.ColumnName` object.