Azure Data Factory (ADF) is a tool within Microsoft's Cloud Platform that provides an environment for executing data transformations. While it integrates well with coding-based transformation tools such as Databricks, Python Notebooks, etc., it also has its own no-coding capabilities - Mapping Data Flows.
Mapping Data Flows (MDF) provides a no-coding approach to data transformation. It offers a user interface (UI) where you can arrange different tools to execute specific tasks in a predefined order.
Alteryx is also no-coding user interface tool for easy and quick data transformations. Unlike ADF, it is not integrated into a broader platform.
Let's compare these two no-coding ETL tools.
I will not be evaluating the cost of each tool, as this can be complex and highly individualized.
1. Transformations and Data Manipulation
Alteryx offers a wide range of possibilities in this regard. In MDF, there are only 25 possible transformation steps, which is quite limited compared to Alteryx's extensive library of 300+ tools. However, the number of tools alone doesn't necessarily determine the capability of an ETL tool. This alone wouldn’t mean too much, but ADF is also not too versitile, each individual transformation step is very limited to a distinct task.
Let’s take a look at a few examples:
First Row of a Dataset:
Extracting the first row of a dataset is a common task, especially when dealing with unclean data or when you want to identify the best or worst records.
- You need to take the rank and then filter afterward by the rank. In Alteryx this would look similar (Sort and Sample Tool).
Manual data input:
Manual data input is a practical feature when you need to quickly add information to a flow, as it saves you a lot of time when creating lookup tables, when mapping data, etc.
- In MDF this is not possible by default. Her you would need to generate a new data file (e.g. a .csv), upload it to a Azure storage, generate a dataset and then finally input it into the Data Flow in ADF.
- In Alteryx, you can use the Text Input tool, which allows you to input text data directly.
Coding sections
- In Alteryx, tools like Python or R allow users to write custom code within their workflows. However, it's worth noting that executing code within Alteryx can often be slower compared to running the same code outside of the platform, and the package installation can be quite painful.
- In ADF adding Code in different languages is simple. Also, various services can be directly integrated, such as Databricks, HDInsight, Azure Machine Learning, Power Query, etc..This allows much more possiblities to use different recourses that fit the users needs and skills best.
Overall Complexcity
- Mapping Data Flows and its user interface are more user-friendly than SQL or similar languages, especially for larger transformations. However, these tools may not always be self-explanatory and might require some additional thought and online research when used.
- Alteryx's user interface simplifies tasks, but it is also not always self-explanatory, particularly as workflows grow in complexity. In such cases, the user interface becomes increasingly valuable.
2. Integration and Extensibility
- Azure Data Factory and its Mapping Data Flows are just one part of the broader Azure environment. As a result, they integrate seamlessly with other Azure services, including various storage services like SQL Database, Data Lake, and Blob Storage, as well as data analysis services such as Azure ML and PowerBI. Additionally, Azure offers a range of DevOps services that provide diverse possibilities. All these services are fully cloud-based.
- Alteryx, on the other hand, is a standalone tool. While there is the Alteryx Gallery and Cloud that offer some capabilities, data storage, or visualization services need to be managed outside of the Alteryx environment. This often requires additional setup time for establishing connections, opening ports, and configuring individual programs. However, it also provides greater flexibility.
3. Performance
Comparing Performance of MDF vs. Alteryx is not that simple. Azure, as a cloud based service allows to scale performance easily while increasing its cost. In the following the cheapest Azure Data Factory Configuration is used (Azure Integration Runtime, Compute type: General purpose, 4+4 cores) .
💡 Azure Data Factory itself does not execute its pipelines. It instead uses an Integration Runtime (IR) which executes the pipeline. ADF's primary role is to orchestrate the IR and provide the user interface
To compare the performance, a very simple Alteryx Weekly Challenge (Challenge 281 (https://community.alteryx.com/t5/Weekly-Challenge/Challenge-281-Holidays/td-p/801909)) was executed - a simple task to find out which country has the most holidays and which one has the most unique holidays. The input dataset for this task was small, containing just a few hundred rows and five columns.
- In Azure Data Factory (ADF), the data was sourced from an Azure Blob Storage location where the Excel file had been uploaded. The following Mapping Data Flow (MDF) was executed to complete the task.
After the pipeline run in ADF was completed, a detailed analysis was performed, focusing on the time taken for each step. Even for this relatively simple task, which involved a dataset with fewer than 1000 rows, the execution time was substantial, with small steps taking multiple seconds each.
The total run duration for this task was 4 minutes and 6 seconds. The input step alone took 47 seconds (coming from an Azure Blob Storage in the same enviroment).
Alteryx
In Alteryx, the execution of this task is significantly faster, even with modest computing power. The Alteryx execution completed in less than 1 second.
If we exclude the Input and Output steps in Azure Data Factory (ADF), taking into account different conditions compared to Alteryx, the execution time still stands at 176 seconds, which is notably long for such a small task. It's important to note that both executions were carried out with a total of 8 cores, although the Alteryx computer likely had substantially higher RAM.
Even if we were to substantially increase the resources allocated to ADF, the performance would still lag significantly behind, and the associated costs would increase considerably.
The cost of executing such a workflow in Azure is (as of September 23) 0.248€ per hour and core (https://azure.microsoft.com/en-us/pricing/details/data-factory/data-pipeline/). Once the integration runtime is started, it runs for at least one hour and the minimum configuration is 8 cores, meaning costs for this flow are about ~ 2€.
This can’t be directly compared to the cost of Alteryx since it comes with yearly license costs and additional computing costs when executing it on your own hardware.
Conclusion
In summary, when it comes to no-coding ETL processes, Alteryx outperforms Azure Data Factory. While ADF offers extensive capabilities for code-based transformations, its no-coding options are limited. Mapping Data Flows in ADF is suitable when coding skills are limited and ADF is already in use. Alteryx outperforms in no-code scenarios, making it an excellent choice for those looking to build ETL pipelines rapidly, especially if they lack coding expertise.