To begin, examine Figure 6.4 to get a perspective as to where batching often resides in a Big Data architecture construct. Recognize that the diagrams and models presented here are examples and that your end solution can and likely will be different. In many cases, the architecture that you come up with may very well be unique to you and your company. You might also recognize that this resembles the lambda architecture model, as discussed in Chapter 3. There is also some resemblance with the serving layer model, as discussed in Chapter 4.
FIGURE 6.4 Batch processing in Big Data architecture
You might be able to conclude from Figure 6.4 that batch processing is utilized for scenarios where the stream of data is noncontinuous, especially when compared to the real‐time ingestion stream. An ingestion phase or step usually happens before the batching procedures are performed. The ingestion is performed to copy, move, or collect data from numerous locations and store it in a data lake hosted on Azure. This ingestion, as you have learned, provides the opportunity to store the data in a single location close to the compute resources you will use to perform data analytics. Additionally, ingestion is the phase where you logically structure and host the data in directory structures like the following:
{Region}/{SubjectMatter}/in/{yyyy}/{mm}/{dd}/{hh}/
{Region}/{SubjectMatter}/out/{yyyy}/{mm}/{dd}/{hh}/
EMEA/brainjammer/in/2022/06/10/14
If the data is contained in files or in a SQL pool inside Azure Synapse Analytics, the most logical place to store this data is in an ADLS container. You may need to perform some kind of transformation prior to running a batch process on it. Having the data on the Azure platform in the recommended directory structure provides you an opportunity to do that. After the data is in your Azure data lake and in a format ready for processing, then you can perform your batch processing. Ultimately, the data is processed and progressed through the numerous DLZs, as shown here—first into the out directory and then through the necessary transformations:
EMEA/brainjammer/out/2022/06/10/17
EMEA/brainjammer/bronze/2022/06/11/10
EMEA/brainjammer/silver/2022/06/11/14
EMEA/brainjammer/gold/2022/06/12/23
If you are working with relational data, then you would use a series of temporary, staging, dimensional, and fact tables to perform the same progression through the Big Data pipeline phases. Batch processing is used to perform specific activities, typically in the transformation phase of the Big Data pipeline. Figure 6.4 attempts to show this position, but there is another model that can illustrate where batch processing is employed in data analytical solutions: the lambda architecture.