Don't use count() when you don't need to return the exact number of rows

When you don't need to return the exact number of rows use:

DataFrame inputJson = sqlContext.read().json(...);
if (inputJson.takeAsList(1).size() == 0) {...}

or

if (inputJson.queryExecution.toRdd.isEmpty()) {...}

instead of:

if (inputJson.count() == 0) {...}

In RDD you can use isEmpty() because if you see the code:

def isEmpty(): Boolean = withScope { 
    partitions.length == 0 || take(1).length == 0 
}

PreviousDon’t collect large RDDs NextAvoiding Shuffle "Less stage, run faster"

Last updated 2 years ago

Was this helpful?