HDInsight – Create and Manage Batch Processing and Pipelines

If you were to run batch processing on HDInsight, you would likely choose either Hive or Pig to write and then execute the script. Hive is a SQL‐like language that includes common SQL commands like SELECT, INSERT, UPDATE, and DELETE. The scripting also supports statements that can calculate the median of a given set of measurements, like the following:

hive> select percentile(cast(AF3THETA as DOUBLE, 0.5) from brainwaves

Pig is a declarative language, which is more like programming. Consider the following brainjammer brain wave readings:

x = (ALPHA:3.9184, BETA_H:1.237, BETA_L:1.911, GAMMA:0.8926, THETA:15.7086)

You can execute some Pig code like the following syntax, which will calculate the median:

y = FOREACH x GENERATE MEDIAN(ALPHA, BETA_H, BETA_L, GAMMA, THETA);

You can also run and schedule these scripts from an Azure Synapse Analytics pipeline, as shown in Figure 6.22.

FIGURE 6.22 Azure HDInsight batch processing

The HDInsight tab is where you select the HDInsight linked service that connects to your Azure HDInsight cluster. The attributes on the Script tab might look familiar. This is because the Script linked service points to an Azure storage account the batch file is hosted on. This is the same approach you took when executing a batch process using Azure Batch and an Azure Databricks Apache Spark cluster. The file you target via the File Path attribute could be a BAT file that contains the code examples you need to run. Performing batch processing on Azure HDInsight is a valid approach if you already have existing procedures on‐premises and you want to move to the Azure platform. The administration and management requirements for running HDInsight batch processing are greater when compared to other Azure products that provide the same functionality.

Azure Data Lake Analytics

As mentioned in Chapter 1, Azure Data Lake Analytics is currently supported only by Azure Data Lake Storage Gen1. ADLS Gen1 is scheduled to retire on February 29th, 2024, and there is no plan to support ADLS Gen2 in Azure Data Lake Analytics. Therefore, Azure Data Lake Analytics will be deprecated on the same day. You can read the official announcement at https://github.com/azure-deprecation/dashboard/issues/209.

Create Data Pipelines

Your first exposure to an Azure Synapse Analytics pipeline in this book was in Chapter 2, “CREATE DATABASE dbName; GO,” Figure 2.31, which may seem a long time ago. If you have persevered and read all the content and completed all the exercises since then, then you are in a good position concerning data pipelines. As a summary, consider that after the first introduction to pipelines, numerous other places exposed you to information about them. For example, Chapter 3 provided detailed coverage of the pipeline feature when accessed via the Integrate hub. You created a pipeline for the first time in Exercise 3.13. Figure 3.51 contains the basic illustration of a pipeline that consists of datasets, linked services, triggers, and activities. You also created a pipeline in Chapter 4 (Exercise 4.13), and in Chapter 5 you performed pipeline work in Exercises 5.1, 5.2, 5.3, 5.6, and 5.13. You also created a pipeline in this chapter—and a rather complicated one at that. It goes without saying that you know what a pipeline is, but a few more aspects are worthy of discussion.

Ileana Pecos

Learn More

Leave a Reply

Your email address will not be published. Required fields are marked *