Often when reading in a file , we want to work with the individual values contained in each line separated by some delimiter. Splitting a delimited line is a trivial operation:
newRDD = textRDD.map(line => line.split(","))
But the issue here is the returned RDD will be an iterator composed of iterators. What we want is the individual values obtained after calling the split function. In other words, we need an
Array[Array[String]]. For this we would use the
flatMap function. For those with a functional programming background, using a
flatMap operation is nothing new. But if you are new to functional programming it’s a great operation to become familiar with.
val inputData = sc.parallelize (Array ("foo,bar,baz", "larry,moe,curly", "one,two,three") ).cache ()val mapped = inputData.map (line => line.split (",") )val flatMapped = inputData.flatMap (line => line.split (",") )val mappedResults = mapped.collect ()val flatMappedResults = flatMapped.collect ();println ("Mapped results of split")println (mappedResults.mkString (" : ") )println ("FlatMapped results of split")println (flatMappedResults.mkString (" : ") )
When we run the program we see these results:
Mapped results of split[Ljava.lang.String;@45e22def : [Ljava.lang.String;@6ae3fb94 : [Ljava.lang.String;@4417af13FlatMapped results of splitfoo : bar : baz : larry : moe : curly : one : two : three
As we can see the map example returned an Array containing 3
Array[String] instances, while the
flatMap call returned individual values contained in one Array.