Data management can be described in many ways. It is a set of disciplines that pertain to the supervision of your enterprise data landscape. The topics covered in a data management scenario can be different and unique, depending on the industry. Figure 5.41 illustrates some of the more common disciplines.
FIGURE 5.41 Data management disciplines
Governance is concerned with privacy, access control, and the retention of your data. Chapter 8 covers this in much more detail and explains which tools the Azure platform provides for implementing a governance and compliance solution. Security has to do with not only controlling access, such as reads, writes, and deletes, but also protecting the physical data at its stored location. This is also covered in more detail in Chapter 8. The management and usefulness of metadata was covered in Chapter 4, but in summary, metadata is information that exists for the purpose of explaining what the data is. It is critical for discovering and identifying the purpose of the data. Managing master data has to do with the storage of data in its originally ingested and stored form. Data is used for many reasons, and over time random updates or deletes can corrupt the data source, rendering it useless. Having a single version of the truth that is protected and managed is critical to the success of data management, which leads to the aspect of data quality. It does not take many actions to reduce the value of a dataset or database. Maintaining the quality should be a regular checking of access and permissions to help keep the data in a useful state. Running a Big Data solution also needs to be managed and is a common part of a data management solution. In many cases the processing of large datasets takes place in the cloud, which means you need to take special care when transferring the data, for example, if the data is of a sensitive nature. As your data progresses from the single version of the truth, aka the master version, through a Big Data solution, it needs a place to be stored, typically a data lake or a data warehouse. The data warehouse location needs policies and protective procedures, just like any other component of the data management solution. Finally, the architecture discipline can cover a hybrid scenario in which some data is stored on‐premises and processing and temporary storage are performed in the cloud. It is important to know what data is stored where and how it flows through the system architecture.
Azure Databricks
Azure Databricks offers some nice charting capabilities, along with a few helpful basic EDA commands. Perform Exercise 5.14 to practice some of those commands and generate a chart or two.
When performing Exercise 5.14, if you receive the error “Failure to initialize configuration. Invalid configuration value detected for fs.azure.account.key,” it means you are trying to access an ADLS container instead of the blob container. Review step 2 of Exercise 5.4 to remind yourself why this happens.
Before you begin Exercise 5.14, place the two Parquet files you created in Exercise 5.13 in an Azure Storage blob container. The two files, NormalizedBrainwavesSE.parquet and transformedBrainwavesV1.parquet, are available on GitHub in the Chapter05/Ch05Ex14 directory.