TIP: Connecting to Azure blob from Apache Spark on Linux DSVM RRS feed

  • General discussion

  • Background: Since December 2016, the Linux DSVM has come with a standalone instance of Apache Spark builtin to help local development before deploying to large Spark clusters on Azure HDInsight or OnPrem. HDInsight Spark by default uses Azure blob storage (or Azure Data Lake Store) as the backing store. So it is convenient to be able to develop on the Linux DSVM with your data on the Azure blob so that you can verify your code fully before deploying it into large Spark clusters on Azure HDInsight. 

    Here are the one time setup steps to take to start using Azure blob from your Spark program. (Ensure you run these commands as root):

    cd $SPARK_HOME/conf
    cp spark-defaults.conf.template spark-defaults.conf
    cat >> spark-defaults.conf <<EOF
    spark.jars                 /dsvm/tools/spark/current/jars/azure-storage-4.4.0.jar,/dsvm/tools/spark/current/jars/hadoop-azure-2.7.3.jar

    If you dont have a core-site.xml in $SPARK_HOME/conf directory run the following:

    cat >> core-site.xml <<EOF
    <?xml version="1.0" encoding="UTF-8"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    Else, just copy paste the two <property> sections above to your core-site.xml file. Replace the actual name of your Azure storage account and Storage account key. 

    Once you do these steps, you should be able to access the blob from your Spark program with the wasb://YourContainer@YOURSTORAGEACCOUNT.blob.core.windows.net/YourBlob URL in the read API. 

    Thanks to Alberto De Marco for the tip. He has also published a blog on this. 

    Friday, March 17, 2017 8:12 PM