Serialization

Serialization plays an important role in the performance of any distributed application. Formats that are slow to serialize objects into, or consume a large number of bytes, will greatly slow down the computation. Often, this will be the first thing you should tune to optimize a Spark application. The Java default serializer has very mediocre performance regarding runtime as well as the size of its results. Therefore the Spark team recommends to use the Kryo serializerarrow-up-right instead.

The following code shows an example of how you can turn on Kryo and how you can register the classes that you will be serializing:

 val conf = new SparkConf().setAppName(...).setMaster(...)
      .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      .set("spark.kryoserializer.buffer.max", "128m")
      .set("spark.kryoserializer.buffer", "64m")
      .registerKryoClasses(
        Array(classOf[ArrayBuffer[String]], classOf[ListBuffer[String]])
      )

You can find another example here BasicAvgWithKryoarrow-up-right

Last updated

Was this helpful?