Design and Develop a Batch Processing Solution – Create and Manage Batch Processing and Pipelines-1

From a general batch processing perspective, you need to consider the following concepts when designing a solution:

  • Using compute resources efficiently
  • I/O management
  • Scheduling and dependencies
  • User interaction

When thinking about the compute resources necessary for running your batch solution, you should consider two factors. The first factor is the code you or a developer will write to extract, transform, and load data from your data lake. In other words, the way a connection is made to the data source, the manner the data is retrieved, the kind of logic written to transform the data, and the way the resulting dataset is stored all require attention. The most important point—and the one that will have the most impact on compute resources—is the code that transforms the data. A very common scenario for processing data is to retrieve a dataset and then program a loop that analyzes each row of the dataset and performs an action on it. The following is a pseudocode example of such a loop:

do{Match(aLargeAmountOfText, “\”(([^\\\\\”]*)(\\\\.)?)*\””)
  response = requests.get(“https://fqdn/api/brainjammer”)
  dataset.MoveNext()}while (dataset.EOF == false)

Some activities are known to cause latency and/or high CPU consumption. An example of an activity that consumes a lot of CPU is searching large strings for keywords. The loading of that much data into memory can be slow, and, depending on how the text in memory is searched, the CPU can become overloaded and processing can grind to a halt. Specifically, running a poorly written regular expression is known to have such an effect on a CPU. An example of a regular expression that will cause 100 percent CPU consumption is shown as the parameter passed to the Match() method. Don’t use anything like this. Another cause of latency has to do with I/O operations, which require thread context switching for reading/writing to disk or accessing a resource located on another computer. Depending on the type of local disk, reading and writing to a physical disk can be slow. Therefore, perform as much transformation in memory in order to avoid the need to read or write to disk. Additionally, as shown in the previous code example, if you call, for example, a REST API within a loop, the job is vulnerable to latency. The job can be latent not only due to the I/O context switch but also due to the current load on that server that hosts the REST API. In general, that would be bad practice, but it does happen and sometimes may be your only option.

Some latencies are expected, however, when running batch jobs, sometimes taking hours. The amount of data being processed can be very large, and that in itself is a cause for them to take a longer time to complete the data processing. There are a few options to mitigate some of the resource consumption problems you might encounter when running batch jobs. A rather simple solution is to schedule the job to run at a time when users or other applications are not competing for the same resources. For example, if the batch job needs to pull data from an OLTP data source and you know there is less activity on that data source after 23:00 at night, then run the batch job after that. This contrasts with attempting to extract data at 10:00 in the morning on a business day. Simply put, avoid prime business hours when scheduling batch jobs.

Ileana Pecos

Learn More

Leave a Reply

Your email address will not be published. Required fields are marked *