Develop a Batch Processing Solution Using Azure Databricks – Create and Manage Batch Processing and Pipelines-2

The configuration of this pipeline is growing in complexity as you progress through these lessons. You might take this that creating a visual diagram is something you should do. Imagine that you need to maintain or make a change to an element in a pipeline. An illustration will be helpful to identify all the dependencies and to locate where a change should take place. The first action in Exercise 6.4 is to create a notebook. Notice that in this case you choose an interactive cluster. You do so because you are about to test and develop some Python code. Later, when you configure the Notebook activity in Azure Synapse Analytics, you will instruct the scheduler to provision an automated job cluster. Then you add the first portion of the code into a cell. As the illustration in Figure 6.18 implies, the AVRO files are retrieved from an Azure Blob storage container instead of the ADLS container. This will be the case until you learn more about security, managed identities, and Azure Key Vault in Chapter 8. You can see that the protocol used to access the files is wasbs, which is most commonly used to access a blob container, instead of abfss, which is the common protocol for ADLS containers. The readPath variable uses a wildcard character to retrieve all the AVRO files from that container and load them into a DataFrame.

The following code uses the results of the EDA found in Table 5.2 to validate the median calculations made on all ClassicalMusic sessions per frequency. It is expected and hoped that the values in the AVRO data files contain values for the given brain wave frequencies that fall within the ranges provided in the code. The next line of code adds the scenario and a timestamp to the DataFrame.

The data is then written on a delta lake table and selected. The output illustrates that four of the 20 ClassicalMusic session have brainjammer brain wave readings that fall into the expected range.

When you return your focus to the TransformSessionFrequencyToMedian pipeline in Azure Synapse Analytics and the configured Azure Databricks linked service, you might have noticed something. One of the configuration requirements is the selection of the New Job Cluster radio button, as shown in the upper middle of Figure 6.17. This is the configuration where you instruct the scheduler to provision an automated cluster. The other option is for provisioning an interactive cluster, which you know isn’t the correct one in this scenario because you are running a batch process. The other option is for provisioning nodes that you have configured in a pool via Azure Databricks.

When you configured a compute cluster in Azure Databricks, on many occasions you needed to modify the Spark configuration file. This is the place where you set up the blob storage endpoint and access key so that the connectivity to the blob storage account is authenticated successfully. The same is required when you want to perform the same action on a job cluster. This is achieved via the Cluster Spark Conf section of the Azure Databricks linked service configuration, as shown in the lower right of Figure 6.17.
fs.azure.account.key..blob.core.windows.net

When you run the pipeline and it gets to the activity that executes the Azure Databricks Notebook batch job, you can see the provisioning, status, and usage of the job cluster in the Azure Databricks workspace. By selecting the Compute navigation menu item, then the Job Clusters tab, you see what is illustrated in Figure 6.19.

FIGURE 6.19 Azure Databricks batch job pipeline status
When you select the link in the Name column, you get access to configuration and event logs. To see the output and performance information, select the Workflows navigation menu item, then the Job Runs tab, and then the Job link.

Azure Data Factory
The batch processing capabilities of Azure Data Factory are almost identical to those of Azure Synapse Analytics. Regardless, you might get some questions on the exam about this tool, and you may find yourself in a position where you need to know about it; therefore, complete Exercise 6.5, which is the same as Exercise 6.2 but has been updated to target Azure Data Factory. As in Exercise 6.2, you need to have completed Exercise 6.1, where you provisioned an Azure Batch account, before starting Exercise 6.5.

Ileana Pecos

Learn More

Leave a Reply

Your email address will not be published. Required fields are marked *