Templating a data onboarding so it's repeatable, not heroic
Data
Stop artisanal data onboarding. Learn to use declarative templates and automation to create repeatable, self-documenting, and reliable data integration pipelines.
The Slack alert fires at 10 PM. A production data pipeline failed on a null value it had never seen before. The source team swore the column was non-nullable. The engineer who onboarded that source three months ago is on vacation, and nobody remembers the specific compromises made during that frantic week.
This isn't a technical failure. It's an operational one, born from a process I call "heroic onboarding." Every new data source is a bespoke project—a flurry of meetings, manual configuration, and hopeful deployments. The engineer is a hero for a day, but the system they leave behind is a brittle collection of unstated assumptions. It doesn't scale, and it breaks in the middle of the night.
The Anatomy of Artisanal Integration
Before we can fix it, we have to be honest about what this heroic approach looks like. It starts with a discovery meeting where a product manager tries to explain their service's database schema from memory. The data engineer takes notes, asks about primary keys, and tries to guess which timestamps represent the "real" event time. This tribal knowledge is stored in a document that immediately goes stale.
Next comes the build phase. The engineer copies an existing pipeline, changes a few dozen strings, and hopes they found them all. The schema is defined manually. Validation logic is a few WHERE clauses tacked on at the end. The process is entirely manual, error-prone, and different every time. The cost is invisible at first, but it compounds with every new source, creating a sprawling, inconsistent mess.
The Template as a Contract
The escape from this cycle is to treat onboarding as a solved problem. The core idea is to apply the principles of Infrastructure as Code (IaC) to data itself. We do this with a declarative template—a single file that serves as an explicit, machine-readable "data contract."
This concept, which others in the industry like Andrew Jones have explored in The Rise of Data Contracts, formalizes the promises a source system makes and the requirements the data platform has. A simple YAML file works beautifully. It captures the essential metadata in one place, much like the sources.yml file in the popular dbt framework.
source_name: user_profiles
owner_team: growth_engineering
source_type: postgres
connection_secret: prod/user-db/creds
source_table: public.users
schema:
- name: id
type: integer
pk: true
- name: email
type: string
pii: true
- name: created_at
type: timestamp_tz
event_time: true
quality_checks:
- type: not_null
columns: [id, email, created_at]
- type: unique
columns: [id]
load_config:
destination_table: raw_users
frequency: hourly
This file is now the single source of truth. It defines the shape of the data, ownership, and the rules of engagement. It's code, ready to be checked into version control.
Automation: Generate, Validate, Deploy
The template is the blueprint; automation is the factory. Once you have that structured contract, you can build deterministic tooling to handle the entire lifecycle. This work is front-loaded, but it pays dividends forever.
- Generate: A command-line tool reads the template and generates all necessary boilerplate—an Airflow DAG, dbt models, or Terraform. A command like
platform-cli onboard --template user_profiles.yamlcreates a consistent foundation. No more copy-pasting. - Validate: This is the most critical step. Before any code is merged, your automation runs tests against the source system based on the contract. Can it connect? Does the schema match? Do the quality checks pass on a sample of real data? If validation fails, the build breaks. In my experience, this step catches the vast majority of surprises before they ever reach production.
- Deploy: The resulting pull request is small, simple, and easy to review. It contains the template and the generated code. The CI/CD pipeline runs validation again, and on merge, the new pipeline is deployed. A multi-day effort becomes a 15-minute, audited process.
From Gatekeeper to Platform Enabler
This architecture profoundly changes the data team's role. They are no longer gatekeepers, a bottleneck for every team needing to integrate a source. They become platform engineers whose job is to improve the onboarding system.
This doesn't mean GUI-based ELT tools have no place; they are excellent for standard SaaS sources. But for internal services and custom logic, a templated approach provides the control, testability, and extensibility that a pure-UI tool cannot. The ultimate goal is to enable other engineering teams to onboard their own data safely. They fill out a YAML file, open a pull request, and watch CI give them a green check. The data team reviews the contract, not the boilerplate.
You will always need an escape hatch for truly bizarre or legacy systems. But by automating the common path, you free up your best engineers to focus their heroic efforts where they're actually needed.
A Living, Self-Documenting System
A powerful benefit emerges over time. Your repository of onboarding templates becomes a living, accurate, and machine-readable catalog of every data source. You can build tooling that scans these templates to automatically populate a data discovery UI, generate data lineage graphs, configure PII masking policies from the pii: true flag, and set up standardized monitoring for every new pipeline.
You've traded one-off heroics for repeatable reliability. And that's an architecture that lets everyone sleep better at night.
Key Principles to Remember
- Define sources declaratively. A single, version-controlled file should be the source of truth for every data source contract.
- Automate everything from the template. Humans define the "what" in the template; machines generate the "how" in the code.
- Validate the contract against reality. Before deployment, automation must connect to the live source and verify that the promises in the template hold true.
- Build a platform, not just pipelines. Shift the team's focus from doing the work to building the tools that enable others to do the work safely and efficiently.