Spark RDD Actions

Question Click to View Answer

What does the following code print? What code is executed on the worker nodes and what code is executed on the driver machine.

val numbersRdd = sc.parallelize(List(9, 2))
val squaresRdd = numbersRdd.map { x: Int => x * x }
val squaresArray = squaresRdd.collect()
squaresArray.foreach(println)

The following output is printed.

81
4

The map method is executed on all of the worker nodes in parallel.

The collect method moves all the data from the worker nodes to the driver program as an array. Methods that return results to the driver node like collect are referred to as "actions".

If the collect method is used to send more data to the driver program than it can handle (if the driver program does not have sufficient memory to handle the size of the Array returned by collect), then the program will crash. Use actions with caution!

What does the following code print?

val namesRdd = sc.parallelize(List("luisa", "melissa", "eva"))
val namesCount = namesRdd.count()
println(namesCount)

3 is printed.

The count method is an action that returns all the counts to the driver program.

What does the following code print?

val numbersRdd = sc.parallelize(List(900, 333, 555, 10))
val biggestNumber = numbersRdd.max()
println(biggestNumber)

900 is printed.

The max method is an action that returns the values from the worker nodes to the driver program.

What does the following code print?

val numbersRdd = sc.parallelize(List(4, 99, 2, 348, 99, 1))
val numbers = numbersRdd.take(2)
println(numbers.toList)

List(4, 99) is printed.

The take method returns the first n elements in a RDD.

What does the following code print?

val numbersRdd = sc.parallelize(List(4, 99, 2, 348, 99, 1))
val numbers = numbersRdd.takeOrdered(2)
println(numbers.toList)

List(1, 2) is printed.

The takeOrdered method returns the smallest n elements in a RDD.

What does the following code print?

val numbersRdd = sc.parallelize(List(4, 99, 2, 348, 99, 1))
val numbers = numbersRdd.top(2)
println(numbers.toList)

List(348, 99) is printed.

The top method returns the largest n elements in a RDD.

What does the following code print?

val numbersRdd = sc.parallelize(List(4, 99, 2, 348, 99, 1))
val sum = numbersRdd.fold(0) { (memo: Int, n: Int) =>
  memo + n
}
println(sum)

553 is printed.

fold is a higher order method that first aggregates the results on each partition and then aggregates the results from each partition.

Explain how Spark uses lazy evaluation to execute the following lines of code.

val numbersRdd = sc.parallelize(List(9, 2)) // step 1
val squaresRdd = numbersRdd.map { x: Int => x * x } // step 2
val evenSquares = squaresRdd.filter { x: Int => x % 2 == 0 } // step 3
val squaresArray = squaresRdd.collect() // step 4

RDD transformations are not actually computed until an application calls an action method of a RDD.

In step 1, the RDD isn't even actually created. The instructions for creating the RDD are saved, but the RDD isn't created.

In step 2, the squaresRdd is not created.

In step 3, the evenSquares isn't created. All of these computations are put off as long as possible - until an action method is called.

In step 4, the action method collect is called, so all of the computations are made. Lazy evaluation allows Scala to optimize the computations before they are made.