Partitioning and wildcards in an Azure Data Factory pipeline

In a previous post I created an Azure Data Factory pipeline to copy files from an on-premise system to blob storage. This was a simple copy from one folder to another one.

It’s possible to add a time aspect to this pipeline. Let’s say I want to keep an archive of these files. In my source folder files get added, modified and deleted. I’d like to create daily snapshots of this folder.

To enable this scenario you only need to change the dataset that represents the file location in blob storage.

{
    "name": "AzureBlobDatasetTemplate",
    "properties": {
        "location": {
            "type": "AzureBlobLocation",
            "folderPath": "myblobcontainer/{Year}/{Month}/{Day}",
            "partitionedBy": [
                {
                    "name": "Year",
                    "value": {
                        "type": "DateTime",
                        "date": "SliceStart",
                        "format": "yyyy"
                    }
                },
                {
                    "name": "Month",
                    "value": {
                        "type": "DateTime",
                        "date": "SliceStart",
                        "format": "MM"
                    }
                },
                {
                    "name": "Day",
                    "value": {
                        "type": "DateTime",
                        "date": "SliceStart",
                        "format": "dd"
                    }
                }
            ],
            "linkedServiceName": "StorageLinkedService"
        },
        "availability": {
            "frequency": "Day",
            "interval": 1
        }
    }
}

I’ve changed the folderPath and added a partitionedBy element. The folderPath now contains placeholders, the elements between curly braces. The definition of these placeholders is defined in the partitionedBy array. Each element will use the variable SliceStart to create a value. The only other variable you can use, at this time, is SliceEnd. If you make this change to the dataset the next run of the pipeline will produce the desired output.

partitionedfiles

 

I also tested what happens when a file is in use, an exception. So make sure the application that creates the files you want to copy releases any locks.

inuse

 

Wildcards can be used in the filename and the filter, so you can copy all txt files in a folder with *.txt. Unfortunately wildcards and partition placeholders can not be combined. So if your files are all being stored in one folder, but each file has the time in its filename (myFile-2015-07-01.txt), you can’t create a filter with the dynamic partitions (myFile-{Year}-{Month}-{Day}txt). It’s only possible to use the partitionedBy section in the folder structure as shown above. If you think this is a nice feature, go vote here!

The price of the current setup is determined by a couple of things. First we have a low frequency activity, that’s an activity that runs daily or less. The first 5 are free, so we have 25 activities remaining. The pricing of an activity is determined on the place where it occurs, on-premise or in the cloud. I’m assuming here it’s an on-premise activity since the files are not located in Azure. I’ve asked around if this assumption is correct but don’t have a response yet. The pricing of an on-premise activity is €0.5586 per activity. So that would mean almost €14 for this daily snapshot each month. If we modified everything to run hourly we’d have to pay €554,80 per month. You can find more info on the pricing on their website.

In this scenario I’ve demonstrated how to get started with Azure Data Factory. The real power however lies in the transformation steps which you can add. Instead of doing a simple copy the data can be read, combined and stored in many different forms. A topic for a future post.

Upside:

  • Rich editor
  • Fairly easy to get started
  • No custom application to write
  • On-premise support via the Data Management Gateway
  • No firewall settings need to be changed

Downside

  • Can get quite expensive

3 Replies to “Partitioning and wildcards in an Azure Data Factory pipeline”

  1. Thanks for the information about the wildcard because I couldn’t find this on Azure documentation.

    I believe your pricing information is incorrect however. The pricing is based on an activity definition (not on how often that activity executes).

    So an activity that executes once a day is a single low frequency activity charge ($0.60). And an activity that executes once every hour is a single high frequency charge ($1).

  2. great post about the wildcard, I have one question though, how do we know what’s being stored in variable SliceStart and how can we set an initial value? for instance, we have an initial load and all of my files will be saved down in ADLS under a folder 2017-10-30, tomorrow the incremental load will save files under folder 2017-11-30, once I create the data sets ,
    * how does ADF know which value to use initially?
    * does this variable SliceStart change only when we run the pipeline? so that if I haven’t run the pipeline for a few days it will pick up where it left?

    Thanks for the post again

Leave a Reply

Your email address will not be published. Required fields are marked *