Design and Develop a Batch Processing Solution – Create and Manage Batch Processing and Pipelines-2

Scaling is a means for managing latency, for example, adding more CPUs and memory or structuring the data so that the batch takes certain pieces of the data. This way, you might have six batch jobs running in parallel, each with instructions to process different datasets from the same data source. Remember that the more quickly your jobs run and the transformation is completed, the faster your business can gain insights from it. Some jobs may be time sensitive; therefore, monitoring latency and improving performance is an important aspect here, as it is in most areas of IT. Finally, note that when you run your batch solutions on the Azure platform, most of these compute resource scenarios are managed by Microsoft so that you can instead focus on your project details. However, it is good to know and to have been exposed to these points in the event you need to design or manage an on‐premises or hybrid batch solution.

The second factor has to do with managing dependencies. Figure 6.2 shows three scenarios: one‐to‐one, one‐to‐many, and a range.

Batching solutions can be very complex, not only because of the actions they perform but also because of the interrelated dependencies they can have with other batch jobs. It is not uncommon to have many hundreds of batch jobs supporting a large enterprise organization. Many of those jobs depend on the output from another batch job that has already been completed successfully. For example, consider that you have a batch job that converts a CSV file to Parquet so that a downstream job that requires data in that format can consume it. If for some reason the file cannot be converted or experiences an error during the process, then the next batch job in the sequence should not proceed. You can manage this kind of scenario by creating relationships between batch jobs. In a one‐to‐one relationship, as shown in Figure 6.2, if the batch job named batchA does not complete successfully, batchB will not be scheduled for execution. A one‐to‐many scenario means that a batch job—in this scenario, batchC—can be executed only if both batchA and batchB complete as expected. Finally, a range or large number of batch jobs must be completed before triggering batchD. In this case a set of 10 batch jobs must be completed before proceeding to batchD. You can also create many‐to‐many relationships between batch jobs, as shown in Figure 6.3.

FIGURE 6.2 Azure Batch processing—dependencies

FIGURE 6.3 Azure Batch processing—many‐to‐many dependency

This scenario requires that both batchA and batchB, which constitute many batch jobs, complete successfully before continuing. If both batchA and batchB are successful, only then are the dependent downstream batch jobs executed. A useful tactic to recognize here is that the unrelated and independent downstream batch jobs can run in parallel. This can decrease the duration of the batch job solution. The last concept to discuss has to do with user interactions with batch job processing. Many problems that happen in any IT solution are caused by human error. Manually executing a SQL query that drops a database or table or wrongly updating a large amount of data can have a significant impact on your business. Recovering from such an event has a lot to do with the precautions you or your company have taken to recuperate from such scenarios. The best way to avoid such a scenario is to use automation to remove the frequency with which humans interact with your data. Performing as much as possible through automation, very often by using batch job solutions, achieves such an objective. If you know your batch jobs are performing as expected, there is rarely a need to intervene. The lack of intervention can also improve data quality, as the same code will produce the same output when run on the same data and would not deviate from that, unlike humans sometimes might. Now that you know some basics about batch jobs, read on to learn more about using batch jobs in the Big Data paradigm.

Design and Develop a Batch Processing Solution – Create and Manage Batch Processing and Pipelines-2

Ileana Pecos

Leave a Reply Cancel reply

Related Posts

SCHEDULED TRIGGERS – Create and Manage Batch Processing and Pipelines

Exam Essentials – Transform, Manage, and Prepare Data

Usage – Transform, Manage, and Prepare Data

Ileana Pecos

Leave a Reply Cancel reply