All data lakes and warehouses have technical challenges. In this post, I will cover some of them.
This is the blog series:
Data Lakes and Warehouses Part 1: Intro to Paradigms
Data Lakes and Warehouses Part 2: Databricks and Snowflake
Data Lakes and Warehouses Part 3: Azure Synapse Point of View
Then, let's go to the challenges.
They can be divided into three distinct categories:
Let us tackle the first two in this blog post. I will use Databricks, Snowflake and Azure Synapse components as examples. However, it is important to note that both the challenge and categories are not limited to them.
I will try to find some solutions for the challenges in future posts.
I need to highlight that even if the discussion is generalizable, most of my and my colleagues’ experience is related to Microsoft Azure cloud environment.
To add up, we work in DevOps and continuous delivery type projects. We always have a common version control like Git and code review practices. We also have multiple identical infrastructure environments that are typically configured with infrastructure such as code tools including Terraform. We target continuous deployments to production.
Consequently, graphical user interface tools do not fit into the approach very well. At least, we should be able to export the configuration to a readable format like JSON. For instance, Azure Data Factory supports this.
The first category is about the challenges related to third-party solutions. From the introduced technologies, this covers evenly both Databricks and Snowflake. There is a wide range of other tools that have the same questions.
Synapse framework, on the other hand, is developed by Azure cloud provider Microsoft. Thus, in this category its position is stronger even if some questions are nontrivial.
First, security. We usually minimize data movement outside the customer’s own cloud environment. We should avoid taking data from the cloud provider network to public internet. Cloud providers have tools like virtual networks and firewalls for this.
Very often the third-party tools run in a separate Azure environment: either in their own account or at least in a separate network architecture. The network configuration is usually more limited compared to native solutions so we might be able to configure secure access for web browser usage but not for the background engine. Sometimes third-party tools even need to be available for public internet connections. This was originally the case for both example technologies, but things are a lot better nowadays.
Related questions are the permissions. In Azure native services, we have a standardized way of working. We can use Azure Active Directory (AAD) users, groups, and role-based access control (RBAC). We can usually synchronize our groups to third-party tools, but the permission model is typically different and must be configured separately.
As explained in the previous posts, the problem might not be as serious for data lake tools due to their simpler permission model. However, data warehouse tools have significantly more granular permission control.
Associated questions relate to getting out logs, error messages and alerts from these tools. We usually need to make some additional configuration to do this.
In a typical data lake scenario, the major component is based on Apache Spark - a distributed computation framework. This is the case for tools such as Databricks and Synapse Spark pool. Let us make it clear: Apache Spark is not a simple tool to use.
The following example is descriptive. Assume that we have an optimized terabyte level data set <data 1>, which has information about names, ages, and countries of residence.
A typical simple Spark code could include the following phases:
The code looks harmless. We could assume that lines 3 and 4, sorting and filtering, would take the most resources to compute. However, instead line 2, printing the number of rows, might explode in the face: it may fill the computer memory or just freeze computation.
The reason is that Spark is a lazy framework and tries to postpone processing as long as possible. Spark might notice that only the ten oldest European people are needed on line 5. Thus, it is enough to find the persons and skip most of the sorting and storage data reading. Nevertheless, printing the row amount on line 2 might force reading all storage data.
On the other hand, if the data in storage were distributed in another fashion, we could end up with the opposite situation. Counting rows might be immediate, and everything else is slow. We need to understand both the Spark framework and the way the stored data is organized.
In data warehousing, data is almost exclusively processed with SQL language. Its compatibility with the scenario is, thus, a critical issue. This covers not only data warehouse products like Snowflake and Synapse Dedicated SQL pool, but also, to some extent, SQL based data lake tools like Synapse Serverless SQL pool.
SQL is a declarative query language. Other typically used languages are either imperative or functional. Illustrating the significance needs a bit of effort so bear with me.
With functional and imperative languages, a typical data processing workflow is:
Assume we noticed a mistake in the modification code on line 2. We repair the code, re-run the workflow, and replace data <file 2> with repaired one. For future development, we store the source data <file 1>and the latest code version.
Nevertheless, SQL language processes the data by either inserting new data, altering values, or deleting in-place. Thus, a typical SQL workflow is:
Again, let us repair a hypothetical bug on line 1. We need to revert the changes done to <data table 1> and execute the repaired code for it. In practice, we need to re-read the original source data and execute the full processing history for it. We hope we have the processing history stored somewhere. In addition, we hope nobody bypassed the process by making some manual changes or so – most likely a false hope.
To make things even worse, we typically have multiple modules that depend on each other. For example, we might store data warehouse modification code, data mart code and permission management code in different repositories. Interdependent ordering is rarely stored, and we need to painfully re-execute everything manually in guessed order.
The same situation appears if we want to create one more identical infrastructure environment for, say, testing purposes.
To be precise, we can use the functional/imperative style workflow for SQL development by, e.g., having multiple table versions. This is cumbersome and rarely done. Moreover, databases might also support writing non-SQL code via a component called stored procedures. However, they are not meant to replace SQL but add some more functionality on top of it.
We have gone through some major challenges that are common for third-party cloud tools, data lake solutions and data warehouse products in a continuously deployed development environment.
In Azure cloud, third-party tools such as Databricks and Snowflake typically have challenges with configuration of, for instance, network infrastructure, permissions, log collecting and alerts. On the other hand, Azure Synapse framework has strengths in this case.
Many data lake paradigm solutions are at least partially based on a major data lake processing environment Apache Spark. As explained, it is a challenging framework. Efficient optimization needs deep understanding of the framework and the way data is stored.
On the other hand, data warehouses are almost exclusively based on SQL querying language. SQL language data modification workflow is problematic from the continuous deployment point of view. It concentrates on altering data in-place which makes processing lineage hard. Repairing bugs and creating new environments needs significantly more effort than with alternative approaches.
In future posts we will hopefully get some ideas on how to solve the issues. Moreover, we might be able to dig a bit deeper into the more product specific questions.
Timo is a cloud big data expert (PhD) with over a decade of experience in modern data solutions. He enjoys trying out new technologies and is particularly interested in technologies of storing, organizing and querying data efficiently in cloud environments. He has worked in big data roles both as a consultant and in-house.