Develop a Batch Processing Solution Using Azure Databricks – Create and Manage Batch Processing and Pipelines-1

  1. Log in to the Azure portal at https://portal.azure.com ➢ navigate to the Azure Databricks workspace you created in Exercise 3.14 ➢ click the Launch Workspace button on the Overview blade ➢ select Compute from the Azure Databricks Workspace navigation menu ➢ start a compute cluster ➢ select Workspace from the Azure Databricks Workspace navigation menu ➢ select Users ➢ select your account ID ➢ select the drop‐down arrow from the pop‐out menu next to your account ID ➢ select Create ➢ select Notebook and provide a name (I used IdentifyBrainwaveScenario) ➢ select Python from the Default Language drop‐down ➢ select the Apache Spark cluster you started earlier from the Cluster drop‐down ➢ and then click Create.
  2. Enter the following code into the first cell. The Jupyter/IPython file named IdentifyBrainwaveScenario.ipynb is in the Chapter06/Ch06Ex04 directory on GitHub at https://github.com/benperk/ADE. Edit the readPath value to the path where the files were outputted to in Exercise 6.3.
  1. Create a new cell and enter the following code:
    df.write.format(“delta”).mode(“overwrite”) \ .saveAsTable(“default.identifiedAs” + scenario)
    display(spark.sql(“SELECT * FROM default.identifiedAs” + scenario))
  2. Return to the Azure portal at https://portal.azure.com ➢ navigate to the Azure Synapse Analytics workspace you created in Exercise 3.3 ➢ click the Open link in the Open Synapse Studio tile on the Overview blade ➢ select the Integrate hub ➢ select the pipeline you created in Exercise 6.2 (for example, TransformSessionFrequencyToMedian) ➢ expand the Databricks Activity group ➢ drag and drop a Notebook activity onto the editor canvas ➢ enter a name (I used Identify Brainwave Scenario) ➢ drag the green box connector from the To Avro activity to the Notebook activity ➢ select the Azure Databricks tab ➢ select + New to the right of the Databricks Linked Service drop‐down list box ➢ enter a name (I used BrainjammerAzureDatabricks) ➢ enable interactive authoring ➢ select the Azure subscription, then the Databricks workspace you created in Exercise 3.14 and/or the one you worked with in step 1 of this exercise ➢ and then ensure the New Job Cluster radio button is selected.
  3. Return to your Azure Databricks workspace ➢ select Settings ➢ select User Settings ➢ click Generate New Token ➢ enter a comment ➢ click Generate ➢ copy the token ➢ click Done ➢ after loading completes, paste the token into the Access Token text box of the Azure Databricks linked service you are creating in Azure Synapse Analytics ➢ select version 10.2 or 11.3 from the Cluster Version drop‐down ➢ select Standard_F4 from the Cluster Node Type drop‐down ➢ select 3 from the Python Version drop‐down ➢ expand the Additional Cluster Settings group ➢ and then add your data information into the Name and Value fields in the Additional Cluster Settings section. (You did something similar previously in Exercise 5.4). The configuration should resemble Figure 6.17.
    Name: fs.azure.account.key..blob.core.windows.net

Value: F98yw7on7……==

FIGURE 6.17 Linked service configuration for the Azure Databricks batch job

  1. Click the Commit button ➢ select the Test Connection link on the Azure Databricks tab ➢ select the Settings tab ➢ click the Browse button to the right of the Notebook Path text box ➢ select Users ➢ select your account ID ➢ select the notebook you created in step 2 (for example, IdentifyBrainwaveScenario) ➢ click OK ➢ click the Commit button for the pipeline (for example, TransformSessionFrequencyToMedian) ➢ click Validate ➢ click Publish ➢ and then click Debug. Figure 6.18 illustrates how the TransformSessionFrequencyToMedian pipeline is now configured.

FIGURE 6.18 Azure Databricks batch job pipeline configuration

Ileana Pecos

Learn More

Leave a Reply

Your email address will not be published. Required fields are marked *