Predict Data Using Azure Machine Learning – Transform, Manage, and Prepare Data-2

Be sure to download all the source code in the Chapter05/Ch05Ex15 directory from GitHub at https://github.com/benperk/ADE. Only a portion of the code is provided in Exercise 5.15, as it is repetitive for every brain wave scenario. The first line of code uses the filter() method to project the data down to a specific scenario, electrode, and frequency. The next line of code gives this column a name, which can be used to identify the scenario, electrode, and frequency. The column is cast to a float data type and limited to 20,246 rows of data. The limiting of the data here has to do with ensuring that the number of rows for all scenarios are the same. Finally, an ID column is added to the dataset using the value generated from the row_number() method.

The reason for generating the new column name has to do with the following code snippet, which joins together all the columns into a single DataFrame. It uses the ID column generated by the row_number() method to join the DataFrames together. Because the number of rows is the same for all DataFrames, all the data has a match.

From the dffull DataFrame, the desired and completely transformed dataset is finally realized. The columns containing the brainjammer brain wave readings per scenario, electrode, and frequency are loaded into the DataFrame, ignoring the ID column, as that data is not needed. The contents of the DataFrame are then loaded into the table shown in Figure 5.45.

The data that is submitted to an AutoML job for modeling must be in a format that supports the type of modeling you want to perform. The mapping of those data format requirements to the modeling algorithm is outside the scope of this book. However, note that all the effort to transform the data in Exercise 5.15 was required to perform a linear regression AML model type. Transforming data in such a way to meet the requirements without losing the intent and the value of the data, as you experienced, requires just as much hands‐on experience as it does having the technical skillset.

As you may have noticed in step 7, in addition to a regression model, there were two other models, classification and time series forecasting. These models are summarized here:
• Classification: Predicts the likelihood that a specific outcome will be achieved (binary classification) or detects the category an attribute belongs to (multiclass classification). Example: foresee if a customer will renew or cancel their subscription.
• Regression: Approximates a numeric value based on input variables. Example: predict stock prices based on the weather.
• Time series forecasting: Assesses values and trends based on historical data. Example: predict interest rate developments over the next year.

After selecting the regression model and progressing to the next window, you were prompted to enter the AML workspace and the linked service that pointed to it. The column you selected as the target column was MEDAF3ALPHA. The expectation was that the results of the AML modeling would provide some information about how the meditation brain wave values related to the other values in the table. Finally, you selected the v2.4 Spark pool, which provides the compute for running the modeling. On the next window you left the default value in the Primary metric, which was the Spearman correlation. The Spearman correlation ranking is useful for assessing how strongly a relationship exists between given variables. It is a statistical dependency computation and quite complicated in its description. The other options are Normalized Root Mean Squared Error, R2 Score, and Normalized Mean Absolute Error. Again, these are very scientific concepts; if they interest you, you can pursue them further using other resources.

The next section is rather important, as you want to control the amount of time allowed for the modeling to run. This process can take a lot of resources and a lot of time, and because you are charged for both, it is wise to control them. A reason for the reduction of the data that targeted a specific scenario, electrode, and frequency was to reduce the time required to complete the modeling. After the regression modeling has been completed, many results should be available for review on the Model tab. Potentially thousands of algorithms are performed, and the results rendered.

Ileana Pecos

Learn More

Leave a Reply

Your email address will not be published. Required fields are marked *