Read from SQL Server with Python/Pyspark in Databricks
This is actually really easy, but not something spelled out explicitly in the Databricks docs, though it is mentioned in the Spark docs. Alas, SQL server always seems like it's a special case, so I tend to discount things unless they mention SQL server explicitly. Not this time!
I'm guessing it's about as easy outside of Databricks if you're just running Pyspark. You'll just need to follow these docs and install the proper library - com.microsoft.azure:azure-sqldb-spark
.
But if you're using Databricks already, you don't even have to do that. Admittedly the performance isn't great, but that could be due to a thousand other factors that I have not yet looked into.
jdbc_url = "jdbc:sqlserver://{host}:{port};database={database};user={user};password={password};UseNTLMv2=true".format(
host=host, port=port, database=database,
user=dbutils.secrets.get('database', 'username'),
password=dbutils.secrets.get('database', 'password'))
df = (spark.read
.format("jdbc")
.option("url", jdbc_url)
.option("dbTable", "MyTable")
.load()
)
display(df)