jcardena.com Blog Templating a data onboarding so it's repeatable, not heroic
145 posts
EN ES

Templating a data onboarding so it's repeatable, not heroic

Data

Stop artisanal data onboarding. Learn to use declarative templates and automation to create repeatable, self-documenting, and reliable data integration pipelines.

The Slack alert fires at 10 PM. A production data pipeline failed on a null value it had never seen before. The source team swore the column was non-nullable. The engineer who onboarded that source three months ago is on vacation, and nobody remembers the specific compromises made during that frantic week.

This isn't a technical failure. It's an operational one, born from a process I call "heroic onboarding." Every new data source is a bespoke project—a flurry of meetings, manual configuration, and hopeful deployments. The engineer is a hero for a day, but the system they leave behind is a brittle collection of unstated assumptions. It doesn't scale, and it breaks in the middle of the night.

The Anatomy of Artisanal Integration

Before we can fix it, we have to be honest about what this heroic approach looks like. It starts with a discovery meeting where a product manager tries to explain their service's database schema from memory. The data engineer takes notes, asks about primary keys, and tries to guess which timestamps represent the "real" event time. This tribal knowledge is stored in a document that immediately goes stale.

Next comes the build phase. The engineer copies an existing pipeline, changes a few dozen strings, and hopes they found them all. The schema is defined manually. Validation logic is a few WHERE clauses tacked on at the end. The process is entirely manual, error-prone, and different every time. The cost is invisible at first, but it compounds with every new source, creating a sprawling, inconsistent mess.

Define Source inTemplateAutomationGenerates CodeValidate AgainstLive SourceDeploy ReliablePipeline
The Templated Onboarding Flow

The Template as a Contract

The escape from this cycle is to treat onboarding as a solved problem. The core idea is to apply the principles of Infrastructure as Code (IaC) to data itself. We do this with a declarative template—a single file that serves as an explicit, machine-readable "data contract."

This concept, which others in the industry like Andrew Jones have explored in The Rise of Data Contracts, formalizes the promises a source system makes and the requirements the data platform has. A simple YAML file works beautifully. It captures the essential metadata in one place, much like the sources.yml file in the popular dbt framework.

source_name: user_profiles
owner_team: growth_engineering
source_type: postgres
connection_secret: prod/user-db/creds
source_table: public.users

schema:
  - name: id
    type: integer
    pk: true
  - name: email
    type: string
    pii: true
  - name: created_at
    type: timestamp_tz
    event_time: true

quality_checks:
  - type: not_null
    columns: [id, email, created_at]
  - type: unique
    columns: [id]

load_config:
  destination_table: raw_users
  frequency: hourly

This file is now the single source of truth. It defines the shape of the data, ownership, and the rules of engagement. It's code, ready to be checked into version control.

Automation: Generate, Validate, Deploy

The template is the blueprint; automation is the factory. Once you have that structured contract, you can build deterministic tooling to handle the entire lifecycle. This work is front-loaded, but it pays dividends forever.

  • Generate: A command-line tool reads the template and generates all necessary boilerplate—an Airflow DAG, dbt models, or Terraform. A command like platform-cli onboard --template user_profiles.yaml creates a consistent foundation. No more copy-pasting.
  • Validate: This is the most critical step. Before any code is merged, your automation runs tests against the source system based on the contract. Can it connect? Does the schema match? Do the quality checks pass on a sample of real data? If validation fails, the build breaks. In my experience, this step catches the vast majority of surprises before they ever reach production.
  • Deploy: The resulting pull request is small, simple, and easy to review. It contains the template and the generated code. The CI/CD pipeline runs validation again, and on merge, the new pipeline is deployed. A multi-day effort becomes a 15-minute, audited process.

From Gatekeeper to Platform Enabler

This architecture profoundly changes the data team's role. They are no longer gatekeepers, a bottleneck for every team needing to integrate a source. They become platform engineers whose job is to improve the onboarding system.

This doesn't mean GUI-based ELT tools have no place; they are excellent for standard SaaS sources. But for internal services and custom logic, a templated approach provides the control, testability, and extensibility that a pure-UI tool cannot. The ultimate goal is to enable other engineering teams to onboard their own data safely. They fill out a YAML file, open a pull request, and watch CI give them a green check. The data team reviews the contract, not the boilerplate.

You will always need an escape hatch for truly bizarre or legacy systems. But by automating the common path, you free up your best engineers to focus their heroic efforts where they're actually needed.

SOURCE SYSTEMSService DatabasesEvent StreamsSaaS APIsFile DropsAUTOMATED ONBOARDING & PROCESSING PLATFORMDeclarativeTemplatesGeneration EngineDeterministicPipelinesData WarehouseSERVING & CONSUMPTIONBI DashboardsInternal APIsML Models
Architecture for Repeatable Integration

A Living, Self-Documenting System

A powerful benefit emerges over time. Your repository of onboarding templates becomes a living, accurate, and machine-readable catalog of every data source. You can build tooling that scans these templates to automatically populate a data discovery UI, generate data lineage graphs, configure PII masking policies from the pii: true flag, and set up standardized monitoring for every new pipeline.

You've traded one-off heroics for repeatable reliability. And that's an architecture that lets everyone sleep better at night.

Key Principles to Remember

  • Define sources declaratively. A single, version-controlled file should be the source of truth for every data source contract.
  • Automate everything from the template. Humans define the "what" in the template; machines generate the "how" in the code.
  • Validate the contract against reality. Before deployment, automation must connect to the live source and verify that the promises in the template hold true.
  • Build a platform, not just pipelines. Shift the team's focus from doing the work to building the tools that enable others to do the work safely and efficiently.
JC
Juan Cardena
Enterprise Architect, Data & AI

Enterprise architect with 25 years across web, software, data, and AI. MIT CDAO ’25. Writing on agentic AI in production.