Overview
This year, I will be blogging about migrations and integrations with Power Platform and Common Data Service (CDS) using; Azure Data Factory (ADF) and Power Automate.
This ADF overview blog post is a bit of a regurgitation of https://docs.microsoft.com/en-us/azure/data-factory/ with added insights related to Power Platform. I hope after reading this you will have an overview of Azure Data Factory (ADF) and tips to organise and name Azure Data Factory components.
Azure data factory is an Extract, Transform and Load tool that is completely cloud base and hosted in Azure. It can be thought of as the cloud version of SQL Server Integration ervices (SSIS).
The components of Azure Data Factory are
1. Pipelines,
2. Activities,
3. Linked Services,
4. Datasets,
5. Dataflows and
6. Integration Runtimes (IR)
Pipelines and Activities
Pipelines are used to group a collection of activities into one logical, executable container.
Activities are individual tasks that are performed on data. For example, a Copy Data activity, which copies data from a source to a target or a Data Flow activity that transforms data. These are the two most common activities I use when migrating and integrating Common Data Service (CDS) with other applications.
Activities will generally take a dataset as an input and will produce one or many datasets as an output.
An Azure Data Factory instance might have many pipelines, examples of pipelines might be to;
Copy data from Common Data Service (CDS) to SQL Server.
Archive SQL Data to blob storage, before the next pipeline overwrites the data source with the most current data.
Generally, I will have the below pipelines in a Power Platform migration solution. The pipelines represent the ETL Process, Extract, Transform and Load, as well as additional, common, ETL process functionality of updating references and archiving.
Extract Pipeline gets data from the source and saves it into blob storage, no transformation on the data is done, it is saved to blob storage in the same "shape" as in the source data store. Visually you can think of copy and pasting data from a source data store and to parquet files in blob storage.1. This is also referred to as landing.
Transform Pipeline, gets the “extracted”/”landing” data from blob storage and transforms it, using data flows, into the same shape as the target system, and saves it into Blob Storage, Table Storage or SQL Server Database. This is referred to as staging because you are staging the data before it is loaded into the target system.
Load Pipeline gets the transformed data from staging and loads it into a Model App using the Common Data Service (CDS) Linked Service. Visually you can think of copy and pasting the transformed data to the target Model App.
Update Related Data pipeline, if required there will be a pipeline to Update Lookups. As you cannot set a lookup in Power Platform unless you have the primary key GUID of the child entity. The primary key GUID is not created until the record is created. Therefore you need a pipeline to download loaded records and their newly generated ids and do a second pass over the entities to set any lookup fields. This pattern has the added bonus of being able to create all entity records in parallel in the load pipeline, therefore, improving the performance of the ETL process.
Archive Data Pipeline, if data needs to be archived for diagnostic or compliance reasons, I will copy it to Blob Storage, SQL Server Database or Table Storage.
I will create a future end to end ADF solution blog post demonstrating the above migration process.
Linked Services
Linked services are like connection strings to data stores that Azure Data Factory will be reading from, writing to or both.
Another way of thinking about linked services is a pointer to the data store that Azure Data Factory will be reading from, writing to or both.
Linked services require the following details.
Name, name of the datastore
datastore type, eg. SQL Server, Blob Storage, Table Storage, Oracle, FTP and so on. A full list of data store connectors that are available in Azure Data Factory can be found at https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-overview#supported-data-stores-and-formats
Credentials, valid, authorised credentials so that Azure Data Factory can access the data source/store.
The most common linked services I use when creating ADF pipelines.
Common Data Service (CDS), for obvious reasons. The CDS connector can be either the target or the source depending on the direction of the migration or integration.
Blob storage, I always import files from the source system to blob storage first. This is where the "extracted" data in the Extract Transform Load (ETL) processes is landed, therefore I call the container "landing". Additionally, this is done first, as recommended by Microsoft, to avoid data conversion issues in Azure Data Factory (ADF).
SQL Server or Table Storage is used as the data store for staging. "Staging" is the area where data that has been transformed is saved. My definition of transformed is the source data has been shaped to match the target systems data structure, making the load of the ETL process a simple copy from staging to target. These data sources can also be used to store automatically generated primary keys i.e., Auto Numbers or GUID Ids from CDS or an Identity from SQL server.
COST TIP – To keep the cost of my personal Azure subscription low I use Table or Blob Storage instead of SQL Server. However, when working on an enterprise migration or integration for a client I would be recommending SQL Server, as SQL Server provides flexibility for manipulating data with Views, Functions and Stored Procedures.
I name linked services with the following format <System><Technology><ETLStage? {Source, Ladning,Staging,Target or Archive}><Entity>
Examples
AzureTableStorageStaging
NorthWindsTradingSQLDatabaseSource
AzureBlobStorageArchive
Datasets
Datasets are a conceptual representation of your data that is from a Linked Service. Data sets are used by activities as input or outputs. Activities act on inputted datasets and can store the results in an outputted dataset(s).
An example of a dataset is the data and data structure returned from a SQL Query or the records and attributes from an entity in Common Data Service (CDS).
Generally, I will name Datasets with the following format \<ETL Stage {Source, Landing, Staging, Target or Archive}>\DS_<TECHNOLOGY>_<System>_<Entity>_<ETL Stage>
Examples
\Source\DS_SQL_Northwind_Orders_Source
\Transformation\DS_BLOB_DistinctAccounts_Transformation
\Transformation\DS_BLOB_DedupedContacts_Transformation
\Staging\DS_BLOB_Contacts_Staging
\Target\DS_CDS_Contacts_Target
\Target\DS_CDS_Accounts_Target
\DataUpdate\DS_TableStorage_ExportedAccountsWithIdsFromCDS_Staging
Note Transformations are difficult datasets to name because the name needs to describe the resulting transformation.
Note a good naming convention on datasets is important because generally, you will be managing a lot of datasets in an Azure Data Factory.
Dataflows
I must admit, I know how to build data flows, but I do not know how they work under the hood (behind the scenes) very well.
Dataflows are visually designed transformation logic. Dataflows consists of a stream of transformations.
Dataflows are designed by using what Microsoft calls the Graph. The graph allows you to build a transformation stream. The graph shows the flow of data from one or more sources to one or more sinks.
The most common transformations I use in dataflows are Aggregate, Union, Filter and Join transformations. I believe transformations are a visual representation of Apache Spark API functions, which also maps to the most common SQL DML commands.
COST TIP - debugging data flows can be costly. Be careful when debugging Dataflows as this will spin up and start an Apache Spark cluster which is by far the most expensive feature in Azure Data Factory.
Dataflows use scaled out Apache Spark clusters. I think of this, possibly incorrectly, as ADF managing Data Bricks clusters for you. EG Azure Data Factory abstracts the management of managing Data Brick's clusters for you. Please comment if this is a correct way of explaining it or not.
A cool dataflows feature is it displays the transformed data after each transformation.
Data Flow transformations can be reduced when using SQL Databases, as views and stored procedures can transform most of your data. It would be interesting to do a cost analysis on what is cheaper the cost overhead of SQL databases to perform transformation vs using cheaper blob and table storage data sources and using the expensive data flows to transform your data.
Generally, I will name Dataflows with the following format DF - <Transformation>.
Examples
DF - Get Distinct Accounts
DF - Map Contacts Lookup Values
Integration Runtimes (IR)
Integration Runtimes allow users to configure the infrastructure used by Azure Data Factory (ADF).
I personally have little experience with Integration Runtimes as I use the default Integration Runtime provided with Azure Data Factory (ADF).
If all your data stores are in Azure then the default integration runtime will be fine to use. A custom Integration Runtime needs to be configured when
connecting to data stores on private networks,
running existing SQL Server Integration Service (SSIS) package from Azure Data Factory (ADF),
Or, controlling the region that the data processing occurs in.
References
This information is based on my experience and the information provided by Microsoft at the below links.
Comments