Microsoft Fabric Machine Learning Tutorial - Part 2 - Data Validation with Great Expectations | endjin

Barry Smart 7th May 2024

Course

In part 2 of this course Barry Smart, Director of Data and AI, walks through a demo showing how you can use Microsoft Fabric to set up a "data contract" that establishes minimum data quality standards for data that is being processed by a data pipeline.

He deliberately passes bad data into the pipeline to show how the process can be set up to "fail elegantly" by dropping the bad rows and continuing with only the good rows. Finally, he uses the new Teams pipeline activity in Fabric to show how you can send a message to the data stewards who are responsible for the data set, informing them that validation has failed, itemising the specific rows that failed and the validation errors that were generated in the body of the message.

The demo uses the popular Titanic data set to show features in data engineering experience in Fabric, including Notebooks, Pipelines and the Lakehouse. It uses the popular Great Expectations Python package to establish the data contract and Microsoft's mssparkutils Python package to enable the exit value of the Notebook to be passed back to the Pipeline that has triggered it.

Barry begins the video by explaining the architecture that is being adopted in the demo including Medallion Architecture and DataOps practices. He explains how these patterns have been applied to create a data product that provides Diagnostic Analytics of the Titanic data set. This forms part of an end to end demo of Microsoft Fabric that we will be providing as a series of videos over the coming weeks.

Chapters:

00:12 Overview of the architecture
00:36 The focus for this video is processing data to Silver
00:55 The DataOps principles of data validation and alerting will be applied
02:19 Tour of the artefacts in the Microsoft Fabric workspace
02:56 Open the "Validation Location" notebook and viewing the contents
03:30 Inspect the reference data that is going to be validated by the notebook
05:14 Overview of the key stages in the notebook
05:39 Set up the notebook, using %run to establish utility functions
06:21 Set up a "data contract" using great expectations package
07:45 Load the data from the Bronze area of the lake
08:18 Validate the data by applying the "data contract" to it
08:36 Remove any bad records to create a clean data set
09:04 Write the clean data to the lakehouse in Delta format
09:52 Exit the notebook using mssparkutils to pass back validation results
10:53 Lineage is used to discover the pipeline that triggers it
11:01 Exploring the "Process to Silver" pipeline
11:35 An "If Condition" is configured to process the notebook exit value
11:56 A Teams pipeline activity is set up to notify users
12:51 Title and body of Teams message are populated with dynamic information
13:08 Word of caution about exposing sensitive information
13:28 What's in the next episode?

From Descriptive to Predictive Analytics with Microsoft Fabric:

Microsoft Fabric End to End Demo Series:

Microsoft Fabric First Impressions:

Decision Maker's Guide to Microsoft Fabric

and find all the rest of our content here.

Transcript

Hello fellow Fabricators! In this video we're going to explore how we can use Microsoft Fabric to validate files early in the lifecycle. We will be working on the Titanic Diagnostic Analytics Data Product. This adopts a Medallion Architecture to ingest, process, and project the data by promoting it through the Bronze, Silver, and Gold areas of the lake.

The purpose of this Data Product is to create a Power BI report that allows users to interactively explore the Titanic data. In this video, we're going to focus specifically on processing data that has already been ingested to the Bronze area of the lake. So we're aiming to load, validate, clean and standardize the data before writing it to the Silver area of the lake.

In this series, we want to show how Fabric can deliver important DataOps principles. So we're going to take the opportunity to show how we can validate the data and then alert relevant people within the organization if data validation issues are encountered. We all know the saying, rubbish in, rubbish out. So the principle here is to enforce a data contract that sets clear expectations around the quality of the data we will accept into our pipeline. In this case, we're going to use a Fabric Notebook, importing a popular Python package called Great Expectations. And we're going to use that to define and apply a set of data quality rules.

We will then deliberately introduce some bad data to show how we can detect and then drop the bad rows before writing the good rows to the Silver area of the lake. The process is orchestrated using a Fabric pipeline and we'll use new Teams activity in Fabric pipelines to show how you can alert the relevant people in the organization using Teams and provide them with detailed information about the validation issues.

So let's get in to the demo. So here we are in Microsoft Fabric. We've opened the Titanic Diagnostic Analytics workspace. A workspace is where you can gather all of the artifacts related to a specific Data Product. So, looking in this particular workspace, you can see we've got some pipelines, which we use to orchestrate the end to end process. We've got some Notebooks, which contain our core data engineering logic. We've got the Bronze, Silver, and Gold lakehouses that we talked about, and then finally the Power BI semantic model and report that we're presenting to end users. Now in this demo we're specifically interested in the validate location Notebook, so let's look at that in a bit more detail.

One feature in Fabric that we find really useful is the interactive table of contents. It's driven by any headings that you have in markdown cells. So you've got level, heading 1, so on, drives the hierarchy that you can see here on the right hand side. So you can see this very, a very high level describes the stages that we're going to apply in the Notebook.

So the purpose of the Notebook is to validate the locations data. Um, and it's this very simple spreadsheet that we've already landed on the Bronze area of the lake. Now, this is quite typical in, in data projects. You've got your core operational data. So in this case, we've got our passenger data from the Titanic.

It's got thousands of rows of data in it. And one of the columns, um, It records the location where the passenger boarded the Titanic. And it's a single uppercase character. So it's not very helpful when you're presenting this information in a report. So the purpose of this spreadsheet is to basically decode That single letter code to make it more useful.

And so you can see here we are recording a kind of more helpful name for the location. So you can see the ship stopped, started at Southampton in England. Then it stopped in France at Cherbourg to pick up more passengers. And then finally Queenstown in Ireland before crossing the Atlantic. We're also using this reference spreadsheet to gather Then we look now for the latitude and longitude for each of those locations so that we can plot them on a map in Power BI and you'll see that later on.

Finally, you can see that we have also added an extra row of data here that's invalid. the the longitude is well out of range, for a longitude. And the, the embark code is three letters when it should be just a single uppercase character. So this row is a bad represents a bad row of data. So we want this Notebook to load the data, apply our sort of data contract to the spreadsheet and reject this bad row of data.

So let's just see that in action. So we can see the process here on, on the right hand side. We're going to create some expectations, load the data into Pandas DataFrame, apply our expectations to that data. If we detect any bad records, we're going to remove them. Then we're going to write remaining clean records to the like. And then finally, we're going to exit the Notebook, but providing detailed context of any validation issues that were encountered. So let's have a look at that in a bit more detail. So in terms of setting up, we're running a couple of utility Notebooks that contain code that we're using across other areas of our solution.

The first is a helper to help us set up the grade expectation suite and then deal with the output from the validations that we run using our expectations. And then the second Notebook helps us to set up our logging. So here we're setting up a connection to Log Analytics in Azure, and that allows us to centrally capture all of our logging to support that observability data ops principle that we talked about earlier.

So now we're into the core function of our Notebook. We're going to validate locations. So we set up an expectation suite, we give it a name, and now we're able to create individual expectations. Now at this stage we're not running the expectations, we're simply configuring them as a set of rules that build up to create our data contract. So the first rule is around, checking that all four columns are present in the spreadsheet that we're expecting to be there. The next rule checks that there are at least three columns in the data. So we're applying some domain knowledge here. We know that the Titanic visited three locations, so we should expect three rows, at least three rows, in the data.

We then check that the, longitude and the latitude These are in acceptable ranges, so there's clear limits to those based on the world being a globe. Um, and finally, we can see here that we check that the actual Embark location code is a single uppercase character. We do that using a regex expression.

And throughout all of these expectations, you can see that we've used this sort of metadata I'll see to capture additional useful information that we can play back to end users should any of these expectations fail. And you'll see that in action later on. So now we've got our contract set up, we can now start to apply it.

So we load the data from the Bronze area of the lake. The spreadsheet that I showed you has already been ingested and landed on the lake. So we can use Pandas to read the data from that Excel sheet. Um, we can log any exceptions along the way because we've got our logger now set up. And we can display that data.

So you can see here, it's faithfully loaded all four rows of data from that spreadsheet. We can then apply our data contract, our expectations to that data frame. So we pass the data frame, to Great Expectations and ask it to validate that data. And you can see it's applied the validation process here.

One of the outputs from that process is a list of indexes or row indexes in the data frame that have failed validation. So we can use that to remove those invalid rows and therefore generate a clean set of data. So you can see here that we've We've got three clean rows of data. We've removed that bad row of data from the location information.

So we're now in a position to write that data to the lakehouse. We do that by using Spark, the Spark engine in Fabric. We create a Spark DataFrame from a Pandas DataFrame, and we use the Delta format to write to a Lakehouse table. So we're writing the data in a structured way, using a standard format, which is Delta, into the Lakehouse, so that we can then use it to write downstream. And you can see here we're using the overwrite mode, so we're not appending the data here, we're just basically, we've got a new reference data set and we're overwriting anything that was in that table with this new data set. And you can see here we're logging that we've written those three rows to the lake house.

And then the final stage in the Notebook is to pass From the Notebook to any pipeline that's called a Notebook, a JSON object that describes the results of that validation process. So you can see overall here. We've recorded a success status of fail or false because it hasn't gone 100 percent successfully.

We've summarized that one of four rows was dropped due to validation issues. And then finally, in the details, um, element of this JSON object, we're passing back, an HTML table which itemizes all the validation issues. So the next stage is for us to go over to the pipeline that calls this Notebook to see how we can deal with this exit information that we're passing back.

So here we are back in the workspace. We've been looking at the validate location Notebook. And if we inspect the lineage for this Notebook, we can see that it's triggered. by the Process to Silver pipeline. Now, if we open up the pipeline, we can see that it runs two Notebooks, the Process Locations Notebook that we've been looking at, and also the Process Passengers Notebook.

And you can see both follow the same pattern. Um, what's happening here is that the The Notebook is run, and the pipeline is then catching that exit value that's returned by the Notebook. And that's used to drive this if condition. So if we drill into that in a bit more detail, we can see that the expression here picks up from the exit value that success element, which is in the JSON object.

And this is a boolean value, so it's set to true or false. And that's used then to drive. So if it's false, we know there's been a data validation issue, so if we look at that condition here, false, we can see we're using the new Teams pipeline activity, which is available in Fabric, and we're using that to dynamically validate the data.

raise a team's message. So again, we're plucking elements out of that JSON object. So the details element is used to form the body of the team's message. And the title element from the JSON object is used to set the subject of the team's message. And we know that the body is a HTML table, so the body of the message needs to be a sort of HTML document and it renders that.

So let's go over and have a look at how that appears in Teams. This is how that Team message appears. This approach creates a fast feedback loop. The team who are responsible for this reference data are immediately notified when validation issues are encountered. And they're also provided with sufficient information to address those issues.

But one word of caution. In the table that forms the body of this team's message, We have chosen here to show the data that caused the validation checks to fail. In some cases, this may not be appropriate if there is a risk of exposing confidential or personal information. So that's it for this video. In the next video, we're going to look in more depth at the processing we do before projecting the data to the Gold layer of the lake.

From a DataOps perspective, we're going to look at how we can put tests in place to prove our Notebook based data engineering logic is working as intended. So please don't forget to hit like if you've enjoyed this video and subscribe if you want to keep following our content. Thanks for watching.