ETL Big DataRefactoring
The customer
Our client is an important data company operating for 25 years in the marketing field in Italy and whose core business concerns the support of companies in the selected and qualitative acquisition of customers and partners.
The state of the art
The ETL (Extract, Transform and Load) procedures existing at the customer stored data on relational databases fed through data sources of various kinds, incrementally, updating the existing data with new information. This obsolescence caused significant problems, specifically:
- slowness of existing ETL procedures, as they are developed with technologies that are difficult to scale and parallelize;
- poor maintainability of the code, due to the technologies used now in disuse;
- high costs and difficulties in monitoring the various steps of the procedures;
- presence of bugs that generated anomalies and inconsistencies of the data and on which it was difficult to intervene;

The challenge
The customer then decided, supported by our consultants, to exploit Big Data technologies for rewriting procedures with the aim of:
- speed up execution times by solving the anomalies introduced by the procedures in use;
- apply transformations and integrate multiple data sources at each step of the computation by exploiting the existing logics;
- schedule and monitor the execution of pipelines;
- analyze the data while maintaining the software already in use in order to eliminate the training costs of the resources involved.
The solution
To meet the requirements it was therefore decided to adopt the following technologies:
- Python and Apache Spark to develop and execute data pipelines, rewriting the integration, matching and transformation logic and exploiting the parallelization and scalability offered by the framework;
- Amazon S3 and HDFS for distributed and replicated storage of data during all phases of computation (input, intermediate results and output);
- Cloud infrastructure for processing, considering that pipelines are performed at regular intervals and for a limited time and therefore an on-premise infrastructure is not required which involves much higher management costs (e.g. maintenance, unavailability or inactivity of resources, potentially variable SLA times, obsolescence);
- Apache Airflow for cloud infrastructure provisioning, scheduling and pipeline monitoring.
The benefits
By applying the above technologies, it has been possible to achieve significant increases in terms of speed in the execution of operations, data quality and infrastructure scalability and significant cost reductions.
Increased speed:
Increased speed:
- overall pipeline execution times have been reduced by 85%: steps previously performed in 3 hours now require less than 30 minutes;
- the developed pipelines operate on the entire data set, with a throughput of 1000 lines / second; the ETL procedures previously used operated on a small subset of data, with a throughput of 3 rows / second;
- the data are acquired in a few seconds: in fact, it is enough to load them on S3 to make them immediately available for computation.
Lower costs:
Lower costs:
- the use of a cloud solution, in “pay-as-you-go” mode, allows you to exploit the computing power of the cluster only when it is needed and deactivate it at the end of the computation, avoiding unwanted costs ;
- It was not necessary to train the resources that analyze the data involved, since the storage takes place in parallel also on the already existing relational databases; in fact, according to the customer’s requirements, the software for the analysis remained the same as previously used.
Improved quality and scalability:
- the data now no longer show the anomalies previously encountered: the quality of the information provided to customers has improved;
- it is now possible to monitor the various steps of the pipeline in order to identify any bottlenecks;
- thanks to the flexibility of cloud solutions, it would be easy to scale and optimize the infrastructure used, should it prove necessary to increase resources.