Wednesday, June 10, 2020

Don’t use SQL for your Big Data Pipelines

The time of coding data pipelines with high-level programming languages

ETL pipelines have been made with SQL since decades, and that worked very well (at least in most cases) for many well-known reasons.
In the era of Big Data, engineers and companies went crazy adopting new processing tools for writing their ETL/ELT pipelines such as Spark, Beam, Flink, etc. and started writing code instead of SQL procedures for extracting, loading and transforming their huge (or little — sic!) amount of data.
To more information visit:big data hadoop course

The Age of Testing Your ETL/ELT Pipelines

Since ETL/ELT pipelines can now be implemented as (a sequence of) jobs written in high-level programming languages like Scala, Java or Python, all your processing steps are expressed as pieces of code that can be structured, designed and, most of all, fully covered with automated tests in order to build robust development and deployment pipelines. In fact, data engineers can now take advantage of working in an environment that includes common tools like git, pull requests, automated tests, builds and even deployments saying goodbye to visual tools embedding huge SQL queries that are almost untestable, unreadable, unmaintainable and hated by any developer (people being involved in traditional DWH development can relate).
The most underrated benefit of this new way of doing ETL/ELT is that automated testing is now a real thing for data pipelines exactly as it is for any other piece of software around the world.
For the sake of clarity, let’s make an example with a piece of pseudo-code (similar to data processing frameworks like Apache Spark) representing a simple pipeline:
FooCollection inputRecords = // extract raw data from source
BarCollection enrichmentRecords = // extract data from external tableOutCollection outputRecords = inputRecords
  .cleanAndNormalize
  .enrichWithBar(enrichmentRecords)
  .toBarCollectionoutputRecords.writeTo(destination)
Altough the pipeline is very simple, real jobs aren’t far away from this structure and one can immediately see that:
  • the code is easy to read and self-explanatory;
  • the code can be splitted in pieces that are easy to test and can be designed for being highly maintainable and well formatted using classes, methods, functions, interfaces and all the common structures an high-level programming language can offer;
  • pipelines can be implemented with TDD approach.
In brief, the whole project can be build around clean, tested and reusable code making your data pipelines robust, easy to maintain and deploy.

Dealing with the “full SQL” style

If you imagine to write the pipeline above with SQL, the result would be a very long query performing lots of case when statements, string manipulations, joins and type conversions all putted together in a single unmaintainable job.
Even when things and responsibilities would be wisely separated by splitting the job in multiple parts, still SQL wouldn’t benefit of having a robust automated testing framwork. Moreover, things like type safety, compile-time errors, coding best practices, test-driven approach and so on wouldn’t be possible at all.
Being honest, some commercial ETL and data integration tools offer some way to test your jobs (altough they are rarely used — but that’s another story). The problem is they are proprietary, sometimes not-free and limited solution that offer data and schema validation relying on pre-built operators and may vary a lot based on the tool of choice.
Intead, performing unit and integration testing with an high level language like Scala or Java is based on the same practices and rules as always and may vary very little from framework to framework with the advantage of giving you unparalleled robustness and coverage.

The Scary Effect of Managing Code

While developers and data engineers got exited with the new trend (since the beginnings of Hadoop-centered solution to newer cloud-based-and-serverless data platforms) many companies didn’t appreciated the amount of new technologies and code the new Big Data ecosystem brought in and started soon to suffer from (at least three) common problems:
  • lack of technical knowledge inside their BI teams;
  • shortage of skilled data engineers in the job market;
  • abundance of SQL-ninjas that were instantly made unable to write ETL/ELT pipelines for new big data projects.
Plus, since many Big Data technologies focused on providing “fast” SQL engines for querying and managing data stored on HDFS, filesystems and cloud blob storages, many companies started adopting tools like Hive, Drill, Impala, etc. not only for data exploration and analysis but as the core of their new ETL/ELT pipelines.
This architectural choice is mainly endorsed by companies who don’t have (and don’t want to hire) data engineers with the right skillset for dealing with these technologies or, instead of training their current employees, just prefer relying on their old-fashioned SQL experts for constructing and managing their brand new data lake.
Unfortunately, when I worked as consultant, I’ve seen more than one company implementing complex ETL/ELT pipelines with tools like Apache Hive embedding SQL queries within data integration tools offering classical GUIs with blocks-and-arrows and “big data connectors”. Under these circumstances, even a solid and rigorous data model didn’t prevent them of having an untested and unmaintainable system where every change is a painful development and is guaranteed to break something, somewhere (maybe in production).
These GUI-and-SQL with no-need-of-coding-skills solutions is a trend that will bring more harm than good.
I just covered the batch data processing use case, but with the need of implementing streaming solutions I feel these tools aren’t mature enough for solving newest challenges of data processing.

TL;DR: Is Big Data processing the end of SQL era?

Obviously not. SQL is a powerful tool for data analysis and exploration even in the Big Data ecosystem and it’s ok if your ETL/ELT jobs use pushdown for executing simple queries against a SQL-based engine when reading data. Moreover, there’s not a one-for-all solution: if all you have to do is (very) simple ETL and data integration, maybe you can still rely on some blocks executing dummy SQL queries onto your engine of choice.
However, when it comes to do complex data transformation and processing, it is a better option to use in-memory and distributed processing engines like Spark, Beam etc. writing your business logic in a high-level programming language within a rock-solid development ecosystem supported by the best practices of writing good code.
The result would be a robust, scalable and maintainable technology stack that both company and engineers would benefit from.
Not having the right technical skills should not be the reason for not using the right tools.

To learn complete Big data tutorials visit:big data and hadoop training

0 Comments: