A Data Engineer’s Highlights of PyCon Berlin 2023

FeldM_Blog_Illustration_TableauTestdriveBern

13.7.2023
Mariia Snihyr

Introduction
Pandays 2.0 and beyond
Large Scale Feature Engineering and Data Science mit Python & Snowflake
An opinionated introduction to Polars
Common issues with Time Series data and how to solve them
WALD: A Modern & Sustainable Analytics Stack
Towards Learned Database Systems
Rusty Python: A Case Study
The search for meaningful test data
Creating Synthetic Data for Open Access
Most of you don’t need Spark. Large-scale data management on a budget with Python
Apache Arrow: connecting and accelerating dataframe libraries across the PyData ecosystem
Postmodern Architecture – The Python Powered Modern Data Stack

I am Mariia, a Data Engineer in the Data Product team at FELD M. In April 2023, my colleague and I visited Berlin to attend the famous PyCon – the largest European convention for the discussion and promotion of the Python programming language.

Every year it gathers Python users and enthusiasts from all over the world and gives them a platform to share information about new developments, exchange knowledge, and learn best practices from each other.

In 2023, PyCon Berlin was merged with PyData, a forum for users and developers of data analysis tools. It lasted for three days and included so many presentations that it would take a team of at least seven people to attend all of them.

Fortunately, the sessions were recorded, and now, after some months, they are available for everyone. You will find a link to the YouTube playlist of PyCon Berlin 2023 talks at the end of this article.

But first, I would like to offer you my own overview of the presentations that we attended and liked the most. Please remember that this overview is based on personal opinion, so it may be biased and different from yours. Feel free to add your perspective in the comments!

1. Pandas 2.0 and beyond

For whom: Software and Data Engineers, Data Scientists, and everyone who works with Pandas (except animal keepers in public zoos, maybe)
Why it’s worth watching: The talk not only covers the changes that were implemented in Pandas 2.0 in comparison with Pandas 1.0, but also touches on the topic of PyArrow which is actively used in the latest version of Pandas. (If you are curious about what PyArrow is, there is a link to the talk about it at the end of this list).
Our verdict: Interesting topic, very relevant for our work, Rating: 9/10
More details can be found here
View a video of the talk on YouTube

2. Large Scale Feature Engineering and Data Science with Python & Snowflake

For whom: Data Scientists, Data Engineers, and those who are interested in Snowflake
Why it’s worth watching: This talk was essentially an introduction to Snowpark, Snowflake’s framework for machine learning development that can work with big data in Python, Scala, or Java.
Our verdict: Good presentation, but you wouldn’t get too much out of it if you don’t work with Snowflake on a regular basis. Rating: 7/10
More details can be found here
View a video of the talk on YouTube

3. Raised by Pandas, striving for more: An opinionated introduction to Polars

For whom: Software and Data Engineers, Data Scientists, and everyone who works with Pandas (but is striving for more)
Why it’s worth watching: The talk gives a really good overview of Polars and inspires you to test it as a more powerful alternative to Pandas.
Our verdict: The speaker was passionate about the framework and a very engaging speaker. The slides were great fun! Above all, the topic of Polars is quite hot at the moment, so definitely: Rating: 10/10
More details can be found here
View a video of the talk on YouTube

4. Common issues with Time Series data and how to solve them

For whom: mostly Data Scientists, but still relevant for anyone working with data
Why it’s worth watching: This talk walks you through four common issues with Time Series data and gives you hints on how to resolve them.
Our verdict: The presentation was quite good, but covered relatively basic things, hence: Rating: 7/10
More details can be found here
View a video of the talk on YouTube

5. WALD: A Modern & Sustainable Analytics Stack

For whom: Data Engineers, BI specialists, and companies and teams who aim to become more data-driven
Why it’s worth watching: The presentation was dedicated to the tools you can use for building a modern reporting pipeline, and WALD, a solution in which these tools are already combined.
Our verdict: We were really curious to check out which technologies our colleagues from other companies use for building reporting pipelines. Also, I have to admit, the slides were very cool! Rating: 8/10
More details can be found here
View a video of the talk on YouTube

If you are looking for a ready-to-use solution that would help you extract more value from your data, check out the development of our Data Product team: Datacroft Analytics Stack - contact us for more details!

6. Towards Learned Database Systems

For whom: Anyone working with databases
Why it’s worth watching: It’s a presentation of the new direction of so-called Learned Database Management Systems (DBMS) where core parts of DBMS are being replaced by machine learning models, which has shown significant performance benefits.
Our verdict: The topic is exciting per se, but kudos to the speaker – he made it even better with his excellent and well-balanced presentation! Rating: 10/10
More details can be found here
View a video of the talk on YouTube

7. Rusty Python: A Case Study

For whom: Software and Data Engineers working with Python
Why it’s worth watching: An overview of Rust and its benefits for Python developers. Exciting presentation about implementing a solution in Rust and integrating it with a Python application using PyO3.
Our verdict: Very interesting topic and excellent presentation, Rating: 10/10
More details can be found here
View a video of the talk on YouTube

8. "Lorem ipsum dolor sit amet"

For whom: Everyone working with software and data
Why it’s worth watching: The talk with its tongue-in-cheek title is dedicated to the process of finding meaningful test data for your software. The importance of this topic can’t be overestimated, so those who work with data on a regular basis should definitely check it out.
Our verdict: Fun slides, but I’ve got a feeling that the main message was a bit diluted by the amount of jokes and examples. Still, it was a useful and engaging session. Rating: 8/10
More details can be found here
View a video of the talk on YouTube

9. Unlocking Information – Creating Synthetic Data for Open Access

For whom: Data Scientists, but might be interesting to anyone working with data
Why it’s worth watching: If you’ve ever wondered how to make the data you used in your work public without disclosing any personal information, this presentation might be exactly what you are looking for.
Our verdict: The topic is a bit niche, though still good for general professional development. Rating: 7/10
More details can be found here
View a video of the talk on YouTube

10. Most of you don’t need Spark. Large-scale data management on a budget with Python

For whom: Software and Data Engineers, Data Scientists
Why it’s worth watching: The talk covered a lot of aspects and technologies that can help you manage large volumes of data and build scalable infrastructure for its processing.
Our verdict: The speaker asks some questions that might make you feel a bit dumb and trigger an episode of impostor syndrome, but besides that the talk was great! Rating: 9/10
More details can be found here
View a video of the talk on YouTube

11. Apache Arrow: connecting and accelerating dataframe libraries across the PyData ecosystem

For whom: Software and Data Engineers, Data Scientists
Why it’s worth watching: If you have heard about PyArrow or Apache Arrow before (e.g., while watching the “Pandas 2.0 and beyond” talk) and you want to dive deeper and find out more about this technology, this presentation is for you. If you haven’t heard of PyArrow before, this presentation is even more perfect for you.
Our verdict: Arrow is fantastic, but the talk was not too light-hearted, so it requires some concentration. Rating: 8/10
More details can be found here
View a video of the talk on YouTube

12. Postmodern Architecture – The Python Powered Modern Data Stack

For whom: Data Engineers, BI specialists, companies, and teams who aim to become more data-driven
Why it’s worth watching: The speaker and his team basically built a competitor of WALD (check #5 in the list). They offer it as a set of technologies forming a flexible stack that can deal with integrating data and extracting value from it.
Our verdict: Again, if you are curious about technologies that can be used for building a modern reporting pipeline, you should watch it. And as a fan of the Brooklyn 99, I can’t help but admire the slides. Rating: 8/10
More details can be found here
View a video of the talk on YouTube

As already mentioned above, there were many more exciting presentations at PyCon Berlin 2023. You can find the full list of sessions with descriptions on the conference schedule page. And, fortunately, the majority of the recordings are now available to everyone on YouTube!

To wrap it up, I can say that PyCon is a great event for everyone who is passionate about programming, data, and, of course, Python. It inspires you to try new things and re-think your approaches, brings you closer to your fellow developer community, and gives you the joy of learning from the best experts in your field.

And of course, it’s a perfect reason to visit the vibrant city of Berlin and enjoy its amazing local food, nightlife scene, rich history and some of the most remarkable sights! We are looking forward to PyCon 2024, and hope that after this article you are too!

If you are interested in our work within the Data Product Team, you can find more information here.

We also showcase some of our data engineering & architecture projects here.

You may also be interested in The EU-US Data Privacy Framework OR “Can I use GA legally now?” Read More

Back to Overview