Parallelism – Create and Manage Batch Processing and Pipelines

The opposite of parallel is serial. If you run activities in a pipeline serially, then the time required to complete the pipeline run is equal to the time each activity takes. For example, if you have five activities, each one takes 30 seconds to complete and they run serially, then the pipeline run will take three and a half minutes. However, if you run those same activities in parallel, the pipeline run will complete in 30 seconds. The caveat to consider when running activities in parallel has to do with dependencies. Figure 6.27 represents a pipeline that contains activities with no dependencies between them.

Notice that no lines connect the five activities together. The absence of those connecting lines signifies that there is no dependency between the activities. When you trigger the pipeline, all activities will run in parallel and complete faster than if they had been run in parallel. Up to this point the dependencies you have created between activities have been based on the successful completion of the activity. However, there are other conditions, such as failed, skipped, and completed. You can add different conditions by selecting the Add Activity On button located on the activity, as shown in Figure 6.28.

FIGURE 6.27 Azure Synapse Analytics pipeline activity with no dependencies

FIGURE 6.28 Azure Synapse Analytics pipeline activity with no dependencies (2)

You see many dependencies now in Figure 6.28. For example, the Stored Procedure activity will not run until the Delete activity has completed successfully. When the Stored Procedure activity is skipped, then the Set Variable activity is performed, and the pipeline run terminates. If the Stored Procedure activity succeeded and the Notebook activity is skipped, then the Copy Data activity is executed, and the pipeline run completes. Since Skipped is also an option under Add Activity On, this is a bit hard to follow, as there is actually no skipped dependency on the notebook. If the stored procedure is successful, the Copy Data activity is executed. However, if the stored procedure fails, the Notebook activity is triggered. When it successfully runs, the Copy Data activity is executed, and the pipeline ends. You can see how complicated dependencies can become with large pipelines. This is a similar scenario to batch processing solutions that have a lot of dependencies. The reason highly skilled technical people are so much in demand is because they can manage all these complexities and produce results.

The pipeline JSON configuration file for TransformSessionFrequencyToMedian is in the Chapter06 directory on GitHub. To conclude this section, complete Exercise 6.6, where you will implement the advanced pipeline concepts you just read about.

Parallelism – Create and Manage Batch Processing and Pipelines

Ileana Pecos

Leave a Reply Cancel reply

Related Posts

SCHEDULED TRIGGERS – Create and Manage Batch Processing and Pipelines

Exam Essentials – Transform, Manage, and Prepare Data

Usage – Transform, Manage, and Prepare Data

Ileana Pecos

Leave a Reply Cancel reply