📜 Context

Every child, from all backgrounds, can be a lifelong learner. 讓每一位孩子,不論出身,都有機會成為終身學習者。

均一教育平台 (Junyi Academy), a distinctive MOOC platform, is committed to offering all children a free, high-quality, and personalized education and learning environment. Unlike other MOOC platforms, Junyi stands apart in its refusal to accept government budgets or cultivate profitable relationships with quid pro quo implications. In addition, Junyi collaborates closely with TeachForTaiwan, having a shared commitment to overcome educational challenges in rural Taiwan. Having recently immersed myself in social enterprises through my involvement with School28, I have developed a keen interest in understanding the technical challenges that organizations like Junyi face.

This article is heavily inspired by Data Mesh Principles and Logical Architecture by Zhamak Dehghani. It also draws on my 8 years of experience working in the data industry. During this time, I have not only witnessed the pains of an organization's data division struggling to keep up with their fast-moving software counterparts, but I have also suffered from the significant burden of playing Mario and building and maintaining ETL plumbing jobs. (Fun fact: I started this journey with Spotify’s ‣, released in 2012.)

This is a breakup letter, as personal as it can be, following a long and toxic relationship with poorly managed infrastructure in the constantly evolving data landscape. I will mostly discuss how seemingly unrelated organizational layouts can actually be interconnected and serve as metrics for evaluating Junyi's current organizational layout and data stack. Additionally, I will dissect the current infrastructure layout and identify the pain points. Finally, I will provide both organizational and technical recommendations to turn 均一 into a well-oiled data machine.

Untitled

🪜 The Evolution of the Modern Data Stack

Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure.

I have previously covered the extensive history of the modern data stack, including the introduction of AWS Redshift (the first Cambrian explosion). However, you can read more about it in this article by Tristan Handy, the CEO and founder of dbt Labs.

The Modern Data Stack: Past, Present, and Future

Phase 1 (2012-2016)

You have an enterprise data warehouse (EWD), along with your operational data store (ODS), and you slap on a reporting layer on top. This is usually analytics-specific with properly-cleaned and organized, de-normalized data. It’s usually schema on write.

i.e. Postgres (ODS) to Redshift (EWD) with custom cron jobs and Periscope on top.

https://documents.lucid.app/documents/649c9feb-ccff-4b58-b318-a9a65361bb62/pages/0_0?a=513&x=-3357&y=-2231&w=576&h=376&store=1&accept=image%2F*&auth=LCA 249acfe7c9304c310cc876d76aa2219265adceb64acd9536ceb6ec02a0c05f9d-ts%3D1687149311

Phase 2 (2016-2018)

You start to introduce a silver bullet - the data lake with unstructured, unorganized data. In addition, the janky ETL’s are now managed by an orchestration framework. It’s still a mess, but at least it’s a localized, programmatic mess. The concept of schema on read is also introduced as it gets pulled out of the data lake. Notice the breakdown of the monolith on the operational side and the rise of micro-services. i.e. Postgres to Redshift with Fivetran for database replication and Airflow as your ETL choice.

Untitled

Phase 3 (2018 - Present)

With the introduction of more data sources, ETL’s are being replaced with ELT’s to bypass the slow transformation step. Furthermore, alternative warehousing options that separate compute and storage have improved the transformation step. You add on streaming for real-time analytics. Additionally, You start to use the power of the cloud to manage your infrastructure.

i.e. Postgres and MySQL to Snowflake running DBT, with an event bus system for events data.

https://documents.lucid.app/documents/649c9feb-ccff-4b58-b318-a9a65361bb62/pages/0_0?a=919&x=-3375&y=-2367&w=1192&h=765&store=1&accept=image%2F*&auth=LCA 7b71af0363e3a83aaba0bdebc578438a71dd381781d5efe4a0f2a390f8e29617-ts%3D1687149311

均一’s Data Infrastructure

Untitled

Let's examine the overall architecture. It seems that we have a data lake using GCP cloud storage, which ingests data from four sources - both internal and external. The pipelines are maintained by Airflow, while the ELT transformation is conducted by DBT. Finally, the dashboard and reporting are served by Metabase and GCP Looker. I categorize 均一 at a solid phase 2.5, which is a natural evolution of many data organizations.