Data Engineer Things Newsletter #23 (Sept 2025)

Open-Source Breakthroughs with Sail and MLflow, Hidden Infrastructure Behind ChatGPT, Streaming Data Into Iceberg, Data Quality Tips, and More.

Data Engineer Things, Shubham Gondane, and Volker Janz

Sep 09, 2025

Highlights of the DET Newsletter September 2025 Edition.

Hey folks,

Good to connect with you all! Our newsletter intros have rotated between teammates in recent editions, and this time I get to do the writing. I’m excited to welcome you to this issue.

I’m writing from the Bay Area as summer winds down. Having lived in Arizona, Texas, and now California, I’ve experienced heat in all its forms. From dry to humid to the famous Bay Area microclimates. And much like climate, data engineering is all about navigating different conditions and making sure systems stay reliable.

That’s why the value of data engineering lies not just in moving and shaping data, but in enabling people to make better decisions and build better products. Without it, analytics and AI rest on shaky foundations. With it, organizations gain speed, resilience, and trust in the data they use every day.

This newsletter is written by and for the data engineering community. Each edition brings together perspectives, tools, and lessons that reflect the collective experience of practitioners in the field. We hope it sparks ideas and helps strengthen the connections across our growing community.

Happy reading, and thanks for being part of the community.

- Shubham

📰 Data Pulse

Data Engineering: Sail released 0.3.2 with native Delta Lake support. Sail is an open-source Rust-based drop-in replacement for Apache Spark. The 0.3.2 release marks a significant milestone with native Delta Lake integration, enabling direct read/write operations on existing lakehouse datasets across S3, Azure, GCS, and Cloudflare R2. Built entirely in Rust, Sail eliminates JVM overhead and garbage collection pauses, delivering 4x faster execution at just 6% of the cost compared to traditional Spark deployments. The Delta Lake support, built against low-level APIs for optimal performance, makes Sail worth a look!
AI: MLflow 3.3.0 is now available! This release introduces several major features and improvements, especially for open-source AI observability and evaluation, including Agno Tracing integration. Agno is an open-source framework for building multi-agent systems with memory, knowledge, and reasoning, along with tracing capabilities.

Agno Tracing via autolog — Agent tracing in MLFlow 3.3.0 with the open-source framework Agno (Source)

AI: Unsloth now supports training the new gpt-oss model from OpenAI! Unsloth, the open-source LLM fine-tuning framework written in Rust, now supports OpenAI's newly released gpt-oss models (20B and 120B parameters), enabling fine-tuning on just 14GB of VRAM for the 20B variant through custom MXFP4 quantization techniques. The official Colab notebook is a great way for Data Engineers to get started.
Data Engineering: Aiven rolls out Iceberg Topics for Apache Kafka: Zero ETL, Zero Copy. Iceberg Topics turn any Kafka topic into an Apache Iceberg table with zero ETL and zero data copies, making streaming data instantly queryable with SQL. By removing connectors and avoiding duplication, this open-source approach cuts cost, simplifies pipelines, and gives teams both real-time and analytical views of the same data.
AI: Creating AI agent solutions for warehouse data access and security. Meta is tackling growing data warehouse access complexity with a new agentic architecture that uses intelligent data-user and data-owner AI agents. These agents streamline access requests, enforce security, and guide users through data discovery, exploration, and permission workflows, while auditing and feedback mechanisms ensure guardrails remain in place.

🗓 DET Meetups in Seattle, Warsaw, NYC, and Bay Area

We have three meetup events coming up in September and one more in October:

Seattle meetup at Databricks on Thu, Sept 18 (RSVP)
Warsaw meetup at Netflix on Thu, Sept 25 (RSVP)
NYC meetup at Capital One on Thu, Sept 25 (RSVP)
Bay Area meetup at Altimate AI on Wed, Oct 1 (RSVP)

(🎤 Interested in speaking at our meetups or online webinars? Submit talk proposals here.)

🏅 Get Apache Airflow Certification for Free

Join the Beyond Analytics virtual conference on Sept 16 for a free workshop on Apache Airflow 3 fundamentals. This workshop will help you prepare for the official Airflow certification exam and answer any questions you may have. Plus, you will get a discount code for a free certification ($150 value).

👉🏼 Sign up for the workshop HERE.

(This message is sponsored by Astronomer.)

🔖 Featured Read

Ever Wonder What Actually Happens When You Hit “Send” on ChatGPT?

Author: Deepanshu Tyagi

We all know ChatGPT feels fast and reliable, but what really happens in those few seconds after you press send? This article reveals the hidden data engineering that powers the experience and shows why the real magic lies in the infrastructure behind the model.

Scale at millions: ChatGPT manages millions of simultaneous conversations with low latency and high reliability.

Streaming backbone: A customized PyFlink + Kafka setup processes events in near real time, orchestrated across regions with Kubernetes.

Kafka Forwarder: Middleware that hides Kafka’s complexity and allows engineers to focus on AI instead of distributed systems.

Zero-downtime upgrades: Multi-cluster architecture and smart traffic shifting let OpenAI swap infrastructure while keeping the system live.

Event-driven AI: Each interaction becomes part of a data flywheel that continuously improves future responses.

Want to dive deeper? The article references these talks:

👉🏼 Read the full article HERE.

📚 Articles of the Month

The Equality Delete Problem in Apache Iceberg: Equality deletes in Apache Iceberg make streaming CDC ingestion tricky, slowing queries and limiting compatibility. This article explains the problem and shows how RisingWave tackles it with smarter delete strategies.
Kafka to Iceberg - Exploring the Options: Thinking of streaming Kafka data into Apache Iceberg? This article breaks down three practical approaches - Flink SQL, Kafka Connect and Confluent Tableflow.
How do Iceberg, Delta Lake, and Hudi ensure atomicity? Consistency in data lakes depends on atomicity. See how Iceberg, Delta Lake, and Hudi safeguard reliability by guaranteeing all-or-nothing writes.
How experienced engineers get unstuck in coding interviews: With many companies sticking to algorithmic interviews despite AI tools, this post reveals a systematic approach to getting unstuck during whiteboard coding. Data Engineers will recognize the parallels between boundary thinking for algorithm optimization and the same mental models needed for query optimization and distributed system design.
From Facts & Metrics to Media Machine Learning: Evolving the Data Engineering Function at Netflix: This article explores how Netflix is expanding its data engineering capabilities with a new Media ML Data Engineering specialization. It highlights the role of their Media Data Lake in powering machine learning workflows across video, audio, image, and text assets.
Soft Skills Matter Now More Than Ever – Harvard Business Review: In today’s AI-driven workplace, technical expertise alone isn’t enough. This article highlights research showing that collaboration, adaptability, and critical thinking are now the skills that set professionals apart and future-proof careers.

(✍️ Interested in publishing articles on DET on Medium? Read submission guidelines here.)

💡 DE Tip of the Month

Getting started with data quality checks doesn't need to mean introducing new frameworks to your data infrastructure. You can accomplish a lot with built-in open source functions from tools like Airflow.

However, if you need more advanced capabilities, here are the most popular dedicated data quality projects to consider:

Soda Core: SQL-based data testing with YAML configuration
Great Expectations: Python-based data validation with extensive documentation
Deequ: Scala library built on Apache Spark for large-scale data validation

Start simple with your existing tools before adding complexity.

Let us know what you like the most in the newsletter. See you next time!

Cheers,

Shubham and Volker

ℹ️ About Data Engineer Things

Data Engineer Things (DET) is a global community built by data engineers for data engineers. Subscribe to the newsletter and follow us on LinkedIn to gain access to exclusive learning resources and networking opportunities, including articles, webinars, meetups, conferences, mentorship, and much more.

Data Engineer Things

Discussion about this post

Ready for more?