Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

model blocks #184

Closed
jthandy opened this issue Oct 18, 2016 · 8 comments
Closed

model blocks #184

jthandy opened this issue Oct 18, 2016 · 8 comments
Labels
enhancement New feature or request stale Issues that have gone stale

Comments

@jthandy
Copy link
Member

jthandy commented Oct 18, 2016

This issue presents a suggested alternative to the way that dbt code is currently organized.

Currently, every .sql file in the models directory is its own model. This proposal is to break that 1:1 relationship between files and models, and instead allow models to be defined similarly to how functions are defined in a programming language. For example:

def [model_1_name]

SELECT a, b
from table_c
where d is not null

def [model_2_name]

...

By itself, this change doesn't accomplish anything except for adding some light organization to a models directory. But the relaxation of the constraint of 1 file : 1 model allows users to create models very differently.

Currently, the 1:1 constraint encourages users to write models that are fairly long and complex, frequently 50-100 lines long or more. These models often contain several CTEs, each of which represents a separate data transformation. These CTEs, because they are wrapped within a larger model, cannot be tested and are not a part of the dependency graph. Plus, long code with multiple CTEs is harder to read, write, and maintain.

This file structure would allow model authors to decompose their models into multiple components, and then control each component as a part of the DAG. Each def is a model, but the default for all models is to be ephemeral, unless instructed otherwise. Ephemeral models can still be tested but are not materialized directly into the database. We still need to consider how this default interacts with model configuration scoping.

Additionally, authors would be able to provide common aliases for queries within that file using an import directive. The import directive is a convenient way for model authors to create a single alias for a table that is then used throughout the entire file. Imports would happen at the beginning of a file:

import {{ref('a_long_model_name')}} as table_c

def [model_1_name]

SELECT a, b
from table_c
where d is not null

def [model_2_name]

...

This import functionality will increase code quality and reduce lines of code, as currently model authors are writing verbose CTEs at the beginning of every model to provide this aliasing functionality.

@jthandy
Copy link
Member Author

jthandy commented Oct 18, 2016

/cc @drewbanin

@jthandy
Copy link
Member Author

jthandy commented Oct 24, 2016

here are some notes on syntax after discussing this extensively over the past few days.


--how users create things:

create_test(args)
create_model(args)
create_macro(args)


--where args is:

args = [
  sql = "",
  materialized = "",
  ...
]

--syntax options follow


--HCL-like example

model "quickbooks_bill_payments" {
  name = "something_else"
  config {
  materialized = true
  }
  sql = """
    select *
    from ${model.quickbooks_billpayment_lines.table_name}
    where whatever
  """
}


--python-like example

def quickbooks_bill_payments (config=(args)) {

  with bill_payments as (

    select * from {{ref('quickbooks_bill_payments')}}

  ), billpayment_lines as (

    select * from {{ref('quickbooks_billpayment_lines')}}

  ), d1 as (

    select
      bp.id, bp.txn_date, bpl.amount, payment_account_id, ap.id as ap_id
    from bill_payments bp
      inner join billpayment_lines bpl on bp.id = bpl.bill_payment_id
      join (select id from {{ref('quickbooks_accounts')}} where type = 'Accounts Payable') ap
        on 1 = 1

  )

  select
    id, txn_date, amount, payment_account_id as account_id, 'credit' as transaction_type,
    'bill payment' as source
    {% if var('uses_classes') == "true" %}
      , null::bigint as class_id
    {% endif %}
  from
    d1

  union all

  select
    id, txn_date, amount, ap_id, 'debit', 'bill payment'

    {% if var('uses_classes') == "true" %}
      , null::bigint
    {% endif %}
  from
    d1

}

@drewbanin
Copy link
Contributor

Sweet writeup -- thanks @jthandy

@jthandy
Copy link
Member Author

jthandy commented Dec 2, 2016

Reasons we might want this:

  • type safety when passing data between packages
  • specific functions (like import)
  • managing scope
  • managing file count
  • more idiomatic override

@jthandy jthandy added the enhancement New feature or request label Jan 11, 2017
@cmcarthur cmcarthur modified the milestone: Rethink DBT Syntax Jan 17, 2017
@drewbanin drewbanin removed this from the Rethink DBT Syntax milestone May 10, 2017
@jthandy
Copy link
Member Author

jthandy commented Oct 12, 2017

A quick update here: this is still very relevant. We've doubled down much harder on jinja since this issue was originally raised, so my guess is the most likely syntax would look something closer to this:

--jinja-like example

{% model quickbooks_bill_payments() %}

  with bill_payments as (

    select * from {{ref('quickbooks_bill_payments')}}

  ), billpayment_lines as (

    select * from {{ref('quickbooks_billpayment_lines')}}

  ), d1 as (

    select
      bp.id, bp.txn_date, bpl.amount, payment_account_id, ap.id as ap_id
    from bill_payments bp
      inner join billpayment_lines bpl on bp.id = bpl.bill_payment_id
      join (select id from {{ref('quickbooks_accounts')}} where type = 'Accounts Payable') ap
        on 1 = 1

  )

  select
    id, txn_date, amount, payment_account_id as account_id, 'credit' as transaction_type,
    'bill payment' as source
    {% if var('uses_classes') == "true" %}
      , null::bigint as class_id
    {% endif %}
  from
    d1

  union all

  select
    id, txn_date, amount, ap_id, 'debit', 'bill payment'

    {% if var('uses_classes') == "true" %}
      , null::bigint
    {% endif %}
  from
    d1

{% endmodel %}

@drewbanin drewbanin changed the title change the dbt file structure model blocks Apr 26, 2018
@clrcrl
Copy link
Contributor

clrcrl commented Jan 16, 2019

A use case that came up today that I think is a good example of where the one-to-one paradigm breaks down:
A dbt user wants to create sanitized views over their fct_ tables, that excludes PII information (assume we have a list of known PII column names).
At the moment this would be achieved one of two ways:

  • creating an extra model for each fct_ table, leading to lots of redundant code
  • creating the view via a post-hook, which means that the objects don't show up in the DAG.

Just thought it worth adding this as a use-case to think through :)

@drewbanin
Copy link
Contributor

cooked this up to explore what sub-tables (like @clrcrl is describing) could look like as custom materializations: https://gist.github.com/drewbanin/ea256260dcb8474414660d1c90d39e3f

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale Issues that have gone stale
Projects
None yet
Development

No branches or pull requests

4 participants