Develop a Batch Processing Solution Using an Azure Synapse Analytics Pipeline – Create and Manage Batch Processing and Pipelines

Log in to the Azure portal at https://portal.azure.com ➢ navigate to the Azure Synapse Analytics workspace you created in Exercise 3.3 ➢ select the Open link in the Open Synapse Studio tile on the Overview blade ➢ select the Manage hub ➢ select Linked Services from the menu list ➢ click the + New button ➢ search for Batch ➢ select Azure Batch ➢ and then click Continue.
Enter a name (I used BrainjammerAzureBatch) ➢ enable interactive authoring ➢ enter the Azure Batch access key (located in the Primary Access Key text box on the Keys blade for the Azure Batch account created in Exercise 6.1) ➢ enter the account name (I used brainjammer) ➢ enter the batch URL (also available on the Keys blade for the Azure Batch account called Account Endpoint) ➢ enter the pool name you created in Exercise 6.1 (I used brainwaves) ➢ select + New from the Storage Linked Service Name drop‐down ➢ create a linked service to the storage account you placed the batch code (brainjammer‐batch.exe) onto in step 3 of Exercise 6.1 ➢ and then click Test Connection. The configuration should resemble Figure 6.7.

FIGURE 6.7 Azure Batch linked service configuration

Click the Commit button ➢ select the Publish menu item ➢ navigate to the Integrate hub ➢ create a new pipeline ➢ rename the pipeline (I used TransformSessionFrequencyToMedian) ➢ expand the Batch Service group in the Activities rame ➢ drag the Custom activity to the editor canvas ➢ rename the activity (I used Calculate Frequency Median) ➢ select the Azure Batch tab ➢ select the Azure Batch linked service you created in step 2 from the drop‐down (for example, BrainjammerAzureBatch) ➢ enable interactive authoring ➢ select the Settings tab ➢ enter run.bat in the Command text box ➢ select the Azure Storage linked service you created in step 2 from the Resource Linked Service drop‐down ➢ click the Browse Storage button next to the Folder Path text box ➢ navigate to the Exercise6.1 directory, which contains the run.bat file that you uploaded in Exercise 6.1 ➢ click OK ➢ and then click Commit. The configuration should resemble Figure 6.8.

FIGURE 6.8 Azure Batch Custom pipeline activity

Click the Validate button ➢ click the Debug button, to test the batch job ➢ exit the Azure Synapse Analytics workspace and navigate to the Azure Batch Overview blade ➢ select Jobs from the navigation menu ➢ select the Job ID link (for example, adfv2‐brainwaves) ➢ and then click the task. It will resemble a GUID. After the batch job is complete, you will see something like Figure 6.9.

FIGURE 6.9 Azure Batch task details

Navigate to your ADLS container. New files will be rendered into the path you provided for outputLocation in Exercise 6.1 appended with the current year, month, day, and hour, similar to Figure 6.10.

FIGURE 6.10 Azure Batch task output

Navigate back to the Azure Synapse Analytics workspace, and then select the Publish menu item.
Figure 6.11 illustrates the configuration.

FIGURE 6.11 Azure Batch—Azure Synapse Analytics batch service pipeline

Note that all of the data being retrieved and stored by all the products configured uses the same ADLS container. This isn’t always the case, but it makes managing the solution architecture easier. There are two points to watch out for. First, as much as possible, make sure the storage account is in the same region as the resources that use the data. This might not always be possible, but realize that you are charged for data ingress/egress when data is moved between Azure regions. Second, there are storage limitations, as shown in Table 6.2. These limits are large, but it is often the case that if you need more, you can contact Microsoft support, and they will increase the limit if there is justification. These limits are often in place to protect the customer from having a runaway rogue process that consumes resources indefinitely, at least up until the bill comes at the end of the month.

TABLE 6.2 Azure Storage limits

Resource	Limit per storage account
Maximum storage capacity	5 PB
Maximum requests per second	20,000

Recall Table 6.1, which summarizes the filesystem and directory components. Figure 6.9 shows an illustration of both. Notice the root/ directory next to the Location label, which symbolizes where the focus is set on the filesystem. Also notice the wd directory folder, which represents the working directory. A task has read, write, update, create, and delete permissions on the wd directory, and any content in the directory is removed after 7 days. There is a task property named RetentionTime, which has a default of 30 days and can be updated on a task‐by‐task‐basis, as required, meaning that each task can have its own retention setting. Finally, two log files are written to the root/ directory: stderr.txt and stdout.txt. If the code throws an exception, the exception message and some details will be logged to the stderr.txt file. Therefore, if you are debugging a task and nothing is happening, you should look into this file. The stdout.txt file is where you can write application logs to. For example, notice in the Program.cs file, which performs the analysis on the brainjammer brain waves, that there are strategically placed WriteLine() methods that contain text details. Python logging code like logging.debug() would also be written to this file, if you had used Python instead of C#

Apache SparkAzure
Spark job definitions were briefly introduced in Chapter 4, where you
read about archiving data using a Python script. Complete Exercise 6.3 to gain
hands‐on experience with this feature. You will create and configure an Azure Spark
job definition that converts JSON files to AVRO files, which are the most
optimal format for batch processing. Then you will add the job to the pipeline
you created in Exercise 6.2.

Develop a Batch Processing Solution Using an Azure Synapse Analytics Pipeline – Create and Manage Batch Processing and Pipelines

Ileana Pecos

Leave a Reply Cancel reply

Related Posts

SCHEDULED TRIGGERS – Create and Manage Batch Processing and Pipelines

Exam Essentials – Transform, Manage, and Prepare Data

Usage – Transform, Manage, and Prepare Data

Ileana Pecos

Leave a Reply Cancel reply