Azure Synapse Pipelines – Create and Manage Batch Processing and Pipelines

At this point you are no beginner when it comes to pipelines. You have configured quite a few in previous exercises. You have not, however, used the Batch Service Azure Synapse Analytics pipeline activity. The Batch Service activity uses an Azure Batch account to employ the aspects of batch processing.

Azure Batch

Azure Batch provides customers with compute resources commonly referred to as nodes for executing large‐scale software‐based workloads. These workloads, or tasks, are custom‐coded programs that perform custom actions. In a high‐performance computing (HPC) environment, these programs can benefit from massive scaling capabilities that provide parallel processing capabilities. An HPC environment offers an extraordinary amount of CPU or memory, commonly used for 3D imaging, financial simulations, or Big Data processing, for example. If a batch solution has tasks that can run independently from each other, Azure Batch can be configured to run those jobs in parallel using resources from a pool of nodes, aka a node pool. Figure 6.5 represents a possible Azure Batch workload.

FIGURE 6.5 An Azure Batch workflow

As illustrated on the right of Figure 6.5, files are generated from some data producer and ingested by numerous means into an ADLS container. A job is scheduled to run at predefined intervals or on‐demand and contains one or more tasks. Those tasks download data from the data source and begin performing their action on the data. If the tasks are unrelated and can be run in parallel, then multiple tasks are triggered simultaneously for execution. Executing code requires a compute resource—in this case, an Azure virtual machine. The VM is allocated from the pool, the code is executed, and then the output is uploaded to a datastore and made available to consumers. In this case, the consumers can be HDInsight, Azure Synapse Analytics, or Azure Databricks, or the data might be ready for reporting and rendering to Power BI. Table 6.1 provides a brief summary of the Azure Batch components.

TABLE 6.1 Azure Batch resource components

ComponentDescription
Batch accountA batch account is a unique entity that contains all Azure Batch resources and compute.
NodeA node is a dedicated Azure VM for processing a segment of the workload.
PoolA pool is the compute resource for executing jobs that contain nodes.
JobA job is a container for tasks, which can number into the millions.
TaskA task is an individual unit of work.
FilesystemEach node is allocated a temporary storage drive dedicated to the task.
DirectoryThe root directory is available for tasks that require system access.

To learn more about these components, complete Exercise 6.1, where you will provision an Azure Batch account and configure it.

Ileana Pecos

Learn More

Leave a Reply

Your email address will not be published. Required fields are marked *