Advanced Data Pipeline Concepts – Create and Manage Batch Processing and Pipelines

Whether you call these advanced or additional concepts, the fact remains that they are some very powerful features available within an Azure Synapse Analytics pipeline. If you find yourself needing to perform some action that you cannot find as an option for, or if you are looking for ways to improve the overall pipeline run, then the following sections discuss a few examples that might help.

Parameterization and Variables

If you look back at Exercise 6.1 you might recall the hard‐coding of some of the arguments that are stored in the run.bat file. When you think about batch processing, one of the benefits is hands‐off automation. That means once you have completed the configuration and deployment of a batch job, you schedule it and let it run without any intervention. However, the values placed into the run.bat file do not really render such an experience. One approach to replace the arguments in the batch file is by using parameters. After selecting the pipeline you created in Exercise 6.2 (TransformSessionFrequencyToMedian), select the Parameters tab. Adding the arguments as parameters might resemble that shown in Figure 6.23.

FIGURE 6.23 Azure Synapse Analytics parameters

Click the Custom activity (Calculate Frequency Median) that executes the run.bat file, and then click the Settings tab. It is then possible to set the arguments within the Command multiline text box with the following syntax, as shown in Figure 6.24:

FIGURE 6.24 Azure Synapse Analytics parameters as command arguments

When the batch job is complete and the pipeline has been run, you can click either the input or output links by hovering over the Run status line on the Output tab to view the parameters, as shown in Figure 6.25.

FIGURE 6.25 Azure Synapse Analytics parameter input and output run details

You will also see a tab in Figure 6.23 called Variables. There is a difference between a parameter and a variable, in that a parameter is not expected to change throughout the execution of the pipeline run, whereas a variable can change between the completion of activities. In the Spark Job Definition activity from Exercise 6.2, you might recall that the container and account the AVRO files are written to was hard‐coded to <container>@<account>. Hard‐coding values is rarely a good idea. You can avoid hard‐coding this by adding a pipeline variable named, for example, containerAccount. You can then reference that variable from the Spark job definition settings by using the following syntax:

@varialbes(‘containerAccount’)

 All that you read here will be performed in Exercise 6.6. Consider this content as a preface to that exercise.

The containerAccount variable can be set by using the concat() method to combine a parameter named storageAccountContainerName with a parameter named storageAccountName, with an @ (at) sign between them—for example:

@concat(pipeline().parameters.storageAccountContainerName, ‘@’,
        pipeline().parameters.storageAccountName)

Another valuable use case for parameters has to do with passing values from one activity to another. This is accomplished by, for example, configuring an argument named itemName in a Get Metadata activity. Once that activity is run, the itemName variable can be referenced by using the following syntax from a dependent activity that runs after the Get Metadata activity:

@activity(‘Get Data’).output.itemName

Figure 6.26 illustrates this in more detail so that you can visualize how to add dynamic content to an argument or value.

FIGURE 6.26 Azure Synapse Analytics passing parameter between pipeline activities

When you set focus to a value or argument text box that supports dynamic content, the link for Add Dynamic Content (Alt+Shift+D) is rendered. Clicking the link or pressing the keyboard combination opens an expression panel for you to add the script that delivers your expected value dynamically.

Ileana Pecos

Learn More

Leave a Reply

Your email address will not be published. Required fields are marked *