Parse XML with PySpark in Databricks

It's kind of a trick title, but here's the answer: don't. Just don't do it. Python is no good here - you might as well drop into Scala for this one [edit: foreach/foreachbatch should actually be pretty good here - I'll add a sample later].

My issue was that I needed to parse XML that's coming in through an Event Hub stream. That is, not from a file. The library that Databricks wrote for XML parsing was optimized for reading directly from files, so it's a little trickier than you'd think.

I found this little snippet online somewhere.


import org.apache.spark.sql.functions._  
import org.apache.spark.sql.types._  
import com.databricks.spark.xml.XmlReader

// cast the binary stream body to the xml-containing string
val stream =  
  .selectExpr("CAST(body AS STRING)").as[(String)]

val df = (new XmlReader()).xmlRdd(sqlContext, stream.rdd)  // <-- This is the magic line  
  .withColumn("user", explode(col("users")))
  .withColumn("action", explode(col("user.actions")))
  .withColumn("action_name", col("action._actionName"))

As noted above, the magic line is really (new XmlReader()).xmlRdd(sqlContext, stream.rdd), which dumps the xml strings into an rdd and then reads it back into the XmlReader. Fantastic. Forgive any atrocious scala.