Spark RDD Transformations

Question	Click to View Answer
What does the following code print? val mammals = List("Lion", "Dolphin", "Whale") val mammalsRdd = sc.parallelize(mammals) val mammalsLengthRdd = mammalsRdd.map { (m: String) => m.length } mammalsLengthRdd.collect().foreach(println)	`4 5 7` is printed. `sc` is the Spark context variable that is set when the Spark shell is initialized. The `parallelize` method is used to make a Resilient Distributed Dataset (RDD). RDDs are a collection of partitioned elements that can be operated on in parallel. The `map` method is used to iterate over the RDD and generate a new RDD that contains the length of each string. The `collect` and `foreach` methods are used to print the results of the RDD.
What does the following code print? val languages = List("spanish", "french", "farsi") val languagesRdd = sc.parallelize(languages) val someLanguagesRdd = languagesRdd.filter { (l: String) => l.take(1) != "f" } someLanguagesRdd.collect().foreach(println)	`spanish` is printed. The `languagesRdd` contains three languages. The `filter` method is used to create an RDD that only includes the languages that don't start with the letter f.
What does the following code print? val numbers = List("one", "two") val letters = List("a", "b") val numbersRdd = sc.parallelize(numbers) val lettersRdd = sc.parallelize(letters) val both = numbers.union(letters) println(both)	`List(one, two, a, b)` is printed. The `union` method is used to create a list that contains all the elements in both of the RDDs.
What does the following code print? val numbersRdd = sc.parallelize(List(1, 2, 3, 4, 3, 2, 1)) val uniqueNumbersRdd = numbersRdd.distinct uniqueNumbersRdd.collect().foreach(println)	`4 1 2 3` is printed. The `distinct` method returns a new RDD that returns all the unique numbers in the RDD.
What does the following code print? val numbersRdd = sc.parallelize(List(10, 20)) val lettersRdd = sc.parallelize(List("a", "b")) val zippedPairsRdd = numbersRdd.zip(lettersRdd) zippedPairsRdd.collect().foreach(println)	The following lines are printed: (10,a) (20,b) The `zip` method combines two RDDs into a single RDD with pairs that correspond to matching elements in each individual RDD.
What does the following code print? case class Person(name: String, age: Int) val bob = new Person("bob", 40) val mario = new Person("mario", 40) val britney = new Person("britney", 16) val peopleRdd = sc.parallelize(List(bob, mario, britney)) val inDaClubRdd = peopleRdd.groupBy { p => { p.age } } inDaClubRdd.collect().foreach(println)	The following is printed: (16,CompactBuffer(Person(britney,16))) (40,CompactBuffer(Person(bob,40), Person(mario,40))) An RDD of `Person` instances is created. The `groupBy` method is used to group the people by their age. The two 40 year olds are grouped together and the 16 year old is in a separate group.
What does the following code print? val numbersRdd = sc.parallelize(List(8, 10, 2)) val sortedRdd = numbersRdd.sortBy(x => x, true) sortedRdd.collect().foreach(println)	`2 8 10` is printed. The `sortBy` method is used to sort all the numbers in the `numbersRdd`. `true` is passed as the second argument to `sortedRdd` to indicate that the numbers should be sorted in ascending order.
What does the following code print? val pairRdd1 = sc.parallelize(List(("a", 1), ("b",2), ("c",3))) val pairRdd2 = sc.parallelize(List(("b", "second"), ("c","third"), ("d","fourth"))) val joinRdd = pairRdd1.join(pairRdd2) joinRdd.collect().foreach(println)	The following output is printed: (b,(2,second)) (c,(3,third)) The `join` method does an inner join of the two RDDs.
What does the following code print? val pairRdd1 = sc.parallelize(List(("a", 1), ("b",2), ("c",3))) val pairRdd2 = sc.parallelize(List(("b", "second"), ("c","third"), ("d","fourth"))) val leftOuterJoinRdd = pairRdd1.leftOuterJoin(pairRdd2) leftOuterJoinRdd.collect().foreach(println)	The following output is printed: (a,(1,None)) (b,(2,Some(second))) (c,(3,Some(third))) The `leftOuterJoin` method is used to perform a left outer join of the two RDDs.