Transform Data by Using Apache Spark– Transform, Manage, and Prepare Data

Log in to the Azure portal at https://portal.azure.com ➢ navigate to the Azure Databricks workspace you created in Exercise 3.14 ➢ click the Launch Workspace button in the middle of the Overview blade ➢ select the Compute menu item ➢ select the cluster you also created in Exercise 3.14 ➢ click the Start button ➢ select the + Create menu item ➢ select Notebook ➢ enter a name (I used brainjammer‐eda) ➢ select Python from the Default Language drop‐down list box ➢ select the cluster you just started from the drop‐down list box ➢ and then click Create.
Enter the following syntax into the first cell, and then run the code:

import pandas as pd
df = spark.read.option(“header”, “true”).parquet(
“wasbs://<container>@<endpoint>/transformedBrainwavesV1.parquet”)
pdf = df.select(df.SCENARIO, df.ELECTRODE, df.FREQUENCY,
df.VALUE.cast(‘float’)).toPandas()

Hover your mouse over the lower middle of the previous cell ➢ click the + to add a new cell ➢ enter the following syntax ➢ and then run the code in the cell.
Add another cell ➢ enter the following syntax ➢ and then run the code.

5. Add another cell ➢ enter then following syntax ➢ and then run the code.

6. Add another cell ➢ enter the following syntax ➢ run the code ➢ select the chart button group expander below the cell results ➢ select Box Plot ➢ select the Plot Options button ➢ configure the chart as shown in Figure 5.42 ➢ and then click Apply.

FIGURE 5.42 Azure Databricks—configuring a box plot chart

When you expand out the box plot, you should see the chart illustrated in Figure 5.43.

FIGURE 5.43 Azure Databricks—configuring a box plot chart (2)

Exercise 5.14 uses a package named Pandas, which is one of the most popular libraries for working with data structures. You can find complete information about this package at https://pandas.pydata.org. The cell imports the package that is preinstalled on an Azure Databricks node by default. No action is required on your part to use this package, other than importing it. The next line of code loads the transformed brainjammer brain waves into an Apache Spark DataFrame. Then, the data is projected to store only the necessary columns, which are selected and transformed into a Pandas DataFrame using the following toPandas() method:

pdf = df.select(df.SCENARIO, df.ELECTRODE,
df.FREQUENCY, df.VALUE.cast(‘float’)).toPandas()

The Pandas package contains, among other things, two methods: head() and tail(). These methods return the first five and last five observations from the dataset, respectively.

The shape Pandas property describes the number of rows and columns in the dataset. In this context the rows are sometimes referred to as observations, and columns as characteristics. Therefore, the dataset consists of 4,437,221 observations, each of which has four characteristics.

pdf.shape
(4437221, 4)

The info() method returns information about the data types (aka column values) attributed to the characteristics. When the df.select() method was performed to populate the dataset, there was a cast() performed on the VALUE characteristic. Therefore, you can see that DType has a VALUE of float32 in the result set.

pdf.info()

The describe() method is useful for retrieving a summary of various statistical outputs. The mean, standard deviation, percentiles, and minimum and maximum values are all calculated and rendered with executions of a single method.

pdf.describe()

Finally, passing a DataFrame as a parameter to the display() method enables charting features. Selecting the charting button below the cell provides some basic capabilities for visualizing your data. Many third‐party open‐source libraries are available for data visualization, for example, Seaborn, Bokeh, Matplotlib, and Plotly. And Azure Databricks offers many options for charts and exploratory data analysis. If you have some ideas or take this any further, please leave a message on GitHub. The notebook used in the code samples in Exercise 5.14 has been exported as a Jupyter notebook and placed on GitHub.

Transform Data by Using Apache Spark– Transform, Manage, and Prepare Data

Ileana Pecos

Leave a Reply Cancel reply

Related Posts

SCHEDULED TRIGGERS – Create and Manage Batch Processing and Pipelines

Exam Essentials – Transform, Manage, and Prepare Data

Usage – Transform, Manage, and Prepare Data

Ileana Pecos

Leave a Reply Cancel reply