Optimising a pyspark job today. Reduced the run time from ~90mins to 15mins on the same cluster by rewriting it as a classic MapReduce job, rather than SparkSQL queries.

While Dataframe APIs have advantages in readability, it is difficult to reason about the performance trade-offs.

Sign in to participate in the conversation
Mastodon is one server in the network