Spark RDD Transformations

Question Click to View Answer

What does the following code print?

val mammals = List("Lion", "Dolphin", "Whale")
val mammalsRdd = sc.parallelize(mammals)
val mammalsLengthRdd = mammalsRdd.map { (m: String) =>
  m.length
}
mammalsLengthRdd.collect().foreach(println)

4 5 7 is printed.

sc is the Spark context variable that is set when the Spark shell is initialized.

The parallelize method is used to make a Resilient Distributed Dataset (RDD). RDDs are a collection of partitioned elements that can be operated on in parallel.

The map method is used to iterate over the RDD and generate a new RDD that contains the length of each string.

The collect and foreach methods are used to print the results of the RDD.

What does the following code print?

val languages = List("spanish", "french", "farsi")
val languagesRdd = sc.parallelize(languages)
val someLanguagesRdd = languagesRdd.filter { (l: String) =>
  l.take(1) != "f"
}
someLanguagesRdd.collect().foreach(println)

spanish is printed.

The languagesRdd contains three languages.

The filter method is used to create an RDD that only includes the languages that don't start with the letter f.

What does the following code print?

val numbers = List("one", "two")
val letters = List("a", "b")
val numbersRdd = sc.parallelize(numbers)
val lettersRdd = sc.parallelize(letters)
val both = numbers.union(letters)
println(both)

List(one, two, a, b) is printed.

The union method is used to create a list that contains all the elements in both of the RDDs.

What does the following code print?

val numbersRdd = sc.parallelize(List(1, 2, 3, 4, 3, 2, 1))
val uniqueNumbersRdd = numbersRdd.distinct
uniqueNumbersRdd.collect().foreach(println)

4 1 2 3 is printed.

The distinct method returns a new RDD that returns all the unique numbers in the RDD.

What does the following code print?

val numbersRdd = sc.parallelize(List(10, 20))
val lettersRdd = sc.parallelize(List("a", "b"))
val zippedPairsRdd = numbersRdd.zip(lettersRdd)
zippedPairsRdd.collect().foreach(println)

The following lines are printed:

(10,a)
(20,b)

The zip method combines two RDDs into a single RDD with pairs that correspond to matching elements in each individual RDD.

What does the following code print?

case class Person(name: String, age: Int)
val bob = new Person("bob", 40)
val mario = new Person("mario", 40)
val britney = new Person("britney", 16)
val peopleRdd = sc.parallelize(List(bob, mario, britney))
val inDaClubRdd = peopleRdd.groupBy { p => {
    p.age
  }
}
inDaClubRdd.collect().foreach(println)

The following is printed:

(16,CompactBuffer(Person(britney,16)))
(40,CompactBuffer(Person(bob,40), Person(mario,40)))

An RDD of Person instances is created. The groupBy method is used to group the people by their age. The two 40 year olds are grouped together and the 16 year old is in a separate group.

What does the following code print?

val numbersRdd = sc.parallelize(List(8, 10, 2))
val sortedRdd = numbersRdd.sortBy(x => x, true)
sortedRdd.collect().foreach(println)

2 8 10 is printed.

The sortBy method is used to sort all the numbers in the numbersRdd. true is passed as the second argument to sortedRdd to indicate that the numbers should be sorted in ascending order.

What does the following code print?

val pairRdd1 = sc.parallelize(List(("a", 1), ("b",2), ("c",3)))
val pairRdd2 = sc.parallelize(List(("b", "second"), ("c","third"), ("d","fourth")))
val joinRdd = pairRdd1.join(pairRdd2)
joinRdd.collect().foreach(println)

The following output is printed:

(b,(2,second))
(c,(3,third))

The join method does an inner join of the two RDDs.

What does the following code print?

val pairRdd1 = sc.parallelize(List(("a", 1), ("b",2), ("c",3)))
val pairRdd2 = sc.parallelize(List(("b", "second"), ("c","third"), ("d","fourth")))
val leftOuterJoinRdd = pairRdd1.leftOuterJoin(pairRdd2)
leftOuterJoinRdd.collect().foreach(println)

The following output is printed:

(a,(1,None))
(b,(2,Some(second)))
(c,(3,Some(third)))

The leftOuterJoin method is used to perform a left outer join of the two RDDs.