In mid-March 2019, we received a message from Christian Weber, the COO of Freeletics, via the contact form on our company website. A former colleague from Switzerland had recommended us because they needed support in redesigning and developing a solution in AWS to replace the existing data warehouse.
In the meantime, I sometimes like to call myself an IT grandpa. For more than 25 years now I have been building various things on the net. In the beginning, it was the Linux router at my family home, which served the family with the internet over LAN.
It helped me to understand the basic functionality of the internet and its protocols like TCP, UDP, or HTTP. Today I support our customers with my team in designing and implementing applications for the internet.
Helping the well-known startup Freeletics to replace their DWH and design a new, future-proof architecture immediately triggered a slight tingling sensation in me. The amount of data to be processed had to be immense!
Although we had already implemented projects on-premise and in the Azure cloud, including what are now often referred to as “big data” projects, we had only gained basic experience with AWS at that time. We therefore did not want to call ourselves experts in this area. But with enough respect for the task, we decided to take a closer look.
After a first meeting with Christian and his data engineering team, it soon became clear that we were going to work together.
The team understood what was important: an in-depth understanding of technologies and their interaction in general, as well as experience in designing and implementing applications.
To develop a new, future-proof system, we first had to understand the existing system. The team helped us to get an insight into the current system that Freeletics had been supplying with data up to that point:
Talend as ETL tool
Mainly batch-based data processing
EC2 Linux machines as ETL hosts
Redshift databases as DWH
Data sources and their target formats in the DWH were identified
Due to the rapid growth of Freeletics customers, to currently more than 42 million, and the resulting data volumes, this solution had reached its limits.
Based on the insights into the existing DWH and the predicted growth, the following requirements were determined:
Events are provided via HTTP
More than 100 events per second must be processed
Data processing almost in real-time
A central database will store Freeletics user data including nested attributes
Events should be enriched with user attributes from the database
The application should be scalable, reliable, expandable, and easy to maintain
AWS as a cloud provider
The operation of the infrastructure should be performed mostly independently by the Data Engineering Team
The data storage is to take place on an S3 DataLake
Python as the primary programming language
PySpark for processing large amounts of data
The Data Analytics team should be able to perform more complex analyses with Spark by accessing raw data.
Teams from product development to finance should have the possibility to independently create generic but complex analyses and simple visualizations.
Due to high read and write accesses and a large number of attributes, including nested attributes, we decided to use a DynamoDB as data storage for Freeletics users.
The fast response time and scalability of DynamoDB were also decisive factors for this choice.
Data is received via an API gateway and processed by Lambda functions.
Processing means the enrichment of events by user attributes (stored in the DynamoDB) as well as GDPR-compliant masking of sensitive data such as IP addresses, email addresses, etc.
A general problem with event-based processing is error handling. Something usually goes wrong, no matter how stable the systems are designed. 🙂
If events are faulty, our software has errors. If a target system is temporarily unavailable, an event cannot be processed successfully at that moment. AWS offers SQS dead-letter queues for this kind of problem.
Source: Wikipedia
If an error occurs during processing, incorrect events are moved to an SQS. Monitoring the queues with a subsequent reprocessing function allows you to process incorrect events again without losing them.
Events are stored in S3 partitioned by event type/YYYY/MM/DD. The storing was done in the Lambda functions mentioned before.
A downstream DataBricks job ensures that individual events are combined into parquet files. This allows faster and more cost-effective access to the data.
Ad spend data was loaded daily via classic Python Batch Jobs via an API, or in the case of larger amounts of data, such as CRM data, processed via Spark Jobs. The resulting data was also provided to the S3 DataLake.
The individual events in DataLake or the events aggregated by DataBricks can be accessed by Spark, which also allows for the efficient analysis of large amounts of data.
Freeletics decided on the Amplitude tool to make data accessible to product managers*, for example. For this purpose, events were transferred to Amplitude by forwarding them to an “export” lambda. Evaluations can thus be carried out almost in real time. Identifiers between the different event types allow complex evaluations for product development and controlling.
We implemented the previously planned system together with Freeletics’ data engineering team.
After a development time of about 8 months, all required functionalities were implemented in the new system and made available to the users. The old DWH could be shut down.
Afterward, the Data Engineering Team of Freeletics took over the operation, maintenance, and further development of the system completely. FELD M supports the team until today, July 2020.
As developers*, we should not be put off or even intimidated by what one of my best friends likes to describe as “technology bombshells”. Extensive experience with and a deep understanding of technology allows us to develop good applications – be it in Azure, AWS, on-premise, or any other environment.
After all, our applications are still based on the good old protocols such as TCP, UDP, or HTTP.