Bq date partitioning #641

drewbanin · 2018-01-19T04:46:34Z

This branch adds support for date partitioning on BigQuery in dbt.

TODO:

fix tests
prevent views from overwriting date-partitioned table, as this could delete a significant amount of data!

Usage

Run for a single day, hard-coded

Simple usage, specify a single date manually:

{{
    config(
        materialized='table',
        partition_date='20180101',
    )
}}

select *
from `public`.`events_20180101`

Run for a range of days

More complex usage, specify a range of dates. The date_sharded_table macro will interpolate the 8-digit date for each of the days between January 1st and January 10th, inclusive. In this way, the resulting date partitioned table will have 10 partitions, one for each day, built from the corresponding date-sharded events_[YYYYMMDD] tables.

{{
    config(
        materialized='table',
        partition_date='20180101,20180110',
    )
}}

select *
from `public`.`{{ date_sharded_table('events_') }}`

Dynamically specify partition date(s)

Use a variable instead of hardcoding a single date. The variable defaults to "yesterday" if partition_date is not provided.

-- macros/datetime.sql
{% macro yesterday() -%}
    {%- set delta = modules.datetime.timedelta(days=-1) -%}
    {{ return(run_started_at + delta).strftime('%Y%m%d') }}
{%- endmacro %}

--models/partitioned.sql
{{
    config(
        materialized='table',
        partition_date=var('partition_date', yesterday()),
    )
}}

select *
from `public`.`{{ date_sharded_table('events_') }}`

This branch is intended to be used in conjunction with #640 to supply variables to date partitioned tables on the command line.

Additional configuration

A full list of configuration options for date partitioned tables is shown below:

{{
    config(
        materialized='table',
        partition_date='2018-01-01,2018-01-10,
        partition_date_format='%Y-%m-%d',
        verbose=True
    )
}}

partition_date_format : The date format (using strptime/stftime conventions) with which to parse the partition_date field. Default: %Y%m%d
verbose : If set to True, dbt will output one log line for each date partition created during the invocation of the date partitioned model. Default: False

$ dbt run
Found 1 models, 1 tests, 0 archives, 0 analyses, 49 macros, 0 operations

19:43:10 | Concurrency: 1 threads (target='dev')
19:43:10 |
19:43:18 | 6 of 6 START table model dbt_dbanin.partitioned.................. [RUN]
19:43:18 | -> Running for day 20180101
19:43:19 | -> Running for day 20180102
19:43:21 | -> Running for day 20180103
19:43:22 | 1 of 1 OK created table model dbt_dbanin.partitioned............. [CREATED 3 PARTITIONS in 4.60s]
19:43:22 |
19:43:22 | Finished running 1 table models in 13.23s.

Completed successfully

Done. PASS=1 ERROR=0 SKIP=0 TOTAL=1

cmcarthur · 2018-01-31T16:29:41Z

my gut reaction to the 20180101,20180110 syntax is that it's not ideal. it seems like it'd be better to pass a range(20180101,20180110) as the partition_date. that's more powerful & would let you do things like partitiondate=[20180101, 20180201] (rebuild jan 1 and feb 1, specifically). but my guess is that you used 20180101,20180110 so that you can pass in a variable range on the command line, is that right?

cmcarthur

i left a few comments that need to be addressed

cmcarthur · 2018-01-31T16:30:25Z

dbt/adapters/bigquery.py

 all_tables = []
 for schema in schemas:
 dataset = cls.get_dataset(profile, schema, model_name)
- all_tables.extend(dataset.list_tables())
+ all_tables.extend(client.list_tables(dataset))


oof, is this the API change you were referencing?

cmcarthur · 2018-01-31T16:31:50Z

dbt/adapters/bigquery.py

- relation_object.delete()
+ client.delete_table(relation_object)
+
+ cls.release_connection(profile, model_name)


i don't think it's correct to release the connection here -- what if you drop the table first, and then create it? i guess for bigquery it makes no difference, but better to exclude it if extraneous

cmcarthur · 2018-01-31T16:32:29Z

dbt/adapters/bigquery.py

- res = cls.fetch_query_results(query)
+ res = list(iterator)
+
+ cls.release_connection(profile, model_name)


same comment as on drop re: releasing connection

cmcarthur · 2018-01-31T16:32:39Z

dbt/adapters/bigquery.py

- dataset.create()
+ client.create_dataset(dataset)
+
+ cls.release_connection(profile, model_name)


cmcarthur · 2018-01-31T16:32:46Z

dbt/adapters/bigquery.py

+ for table in client.list_tables(dataset):
+ client.delete_table(table.reference)
+
+ cls.release_connection(profile, name=None)


cmcarthur · 2018-01-31T16:33:57Z

dbt/include/global_project/macros/materializations/bigquery.sql

+
+ {% for i in range(0, day_count + 1) %}
+ {% set the_day = (modules.datetime.timedelta(days=i) + start_date).strftime('%Y%m%d') %}
+ {% if verbose %}


cmcarthur · 2018-01-31T16:34:26Z

dbt/include/global_project/macros/materializations/bigquery.sql

+ The provided partition date '{{ date_str }}' does not match the expected format '{{ date_fmt }}'
+ {%- endset %}
+
+ {% set res = try_or_compiler_error(error_msg, modules.datetime.datetime.strptime, date_str.strip(), date_fmt) %}


this is really clever haha

cmcarthur · 2018-01-31T16:35:09Z

test/integration/base.py

@@ -173,6 +173,7 @@ def setUp(self):

 # it's important to use a different connection handle here so
 # we don't look into an incomplete transaction
+ adapter.cleanup_connections()


i think this obviates the need for release_connection everywhere else

drewbanin · 2018-01-31T18:56:26Z

@cmcarthur my first cut of this used start_date and end_date as two different variables instead of a single partition_date. That worked pretty well, but it was confusing that when end_date isn't set, the date partitioning only runs for the start_date. Moreover, it's a little more difficult to type out a yaml dictionary with two elements on the command line IMO.

I like the idea of using range conceptually, but these dates are essentially strings, not integers! Eg:

range(20180131, 20180201)

^this would run for 20180131 (good) and then 20180132 (bad), so we'd need to implement our own sort of date_range function I think.

I think you're right though -- it's unusual that partition_date accepts a comma-separated string. One other option I can think of is to make the materialization accept a list of dates to run for, where this list can be generated by a macro from a start/end date pair. We can implement this macro in the global project to make this easy/transparent for users.

So:

$ dbt run --vars 'partition_date: "20180101, 20180131"'

Then in your model:

{{
    config(
        materialized='table',
        partition_date=date_range_to_list(var('partition_date')),
    )
}}

...

and then a macro which looks like:

{% macro date_range_to_list(range_str) %}

  start, end = range_str.split(",")
  dates = []
  for each date in range(start,end):
    dates.append(date)

  return dates

{% endmacro %}

So the CLI interface is the same, but the macro interface can work for a single date, a date range, or a smattering of random dates.

Let me know what you think about this kind of approach

cmcarthur · 2018-02-01T15:10:41Z

I like this approach a lot. We can write up docs on how to set your exact CLI syntax up for a project, but if someone really wanted to implement the other use case, they could do it themselves via:

{{config(partition_date(var('partition_dates').split(',')))}}

dbt run --vars 'partition_dates: 20180101, 20180201'

drewbanin · 2018-02-01T15:12:35Z

Ok, i'll do that

drewbanin · 2018-02-10T02:49:56Z

This is now implemented such that the table materialization can accept a list of "partitions". Eg:

{{
    config(
        materialized='table',
        partitions=['20180101', '20180102', '20180103'],
        verbose=True
    )
}}

The type checking is pretty loosey goosey here, so you can give a list of strings, list of ints, a single string, a single int, etc. These dates must be provided in BigQuery date format, ie. an 8-character series of digits.

To generate this list of dates, users can use the dbt.partition_range function built into the dbt global project. In practice, this looks like:

{{
    config(
        materialized='table',
        partitions=dbt.partition_range('20180101, 20180201'),
        verbose=True
    )
}}

This partition_range function will generate a list of dates in the range of the two provided dates. If only one date is provided, the resulting date range will only contain the date specified. This function also accepts an optional date format string. Finally, this macro can be combined with CLI vars to configure date ranges from the CLI, eg.

$ dbt run --model partitioned_model --vars 'dates: "20180101, 20180201"'

coupled with

{{
    config(
        materialized='table',
        partitions=dbt.partition_range(var('dates')),
        verbose=True
    )
}}

Users can further extend these macros to simplify patterns which they use frequently.

cmcarthur · 2018-02-12T20:49:16Z

lgtm!

* first cut of date partitioning * cleanup, implement partitioning in materialization * update requirements.txt * wip for date partitioning with range * log data * arg handling, logging, cleanup + view compat for new bq version * add partitioning tests, compatibility with bq 0.29.0 release * pep8 * fix for strange error in appveyor * debug appveyor... * dumb * debugging weird bq adapter use in pg test * do not use read_project in bq tests * cleanup connections, initialize bq tests * remove debug lines * fix integration tests (actually) * warning for view creation which clobbers tables * add query timeout example for bq * no need to release connections in the adapter * partition_date interface change (wip) * list of dates for bq dp tables * tiny fixes for crufty dbt_project.yml files * rm debug line * fix tests automatic commit by git-black, original commits: 4eb75ec

* first cut of date partitioning * cleanup, implement partitioning in materialization * update requirements.txt * wip for date partitioning with range * log data * arg handling, logging, cleanup + view compat for new bq version * add partitioning tests, compatibility with bq 0.29.0 release * pep8 * fix for strange error in appveyor * debug appveyor... * dumb * debugging weird bq adapter use in pg test * do not use read_project in bq tests * cleanup connections, initialize bq tests * remove debug lines * fix integration tests (actually) * warning for view creation which clobbers tables * add query timeout example for bq * no need to release connections in the adapter * partition_date interface change (wip) * list of dates for bq dp tables * tiny fixes for crufty dbt_project.yml files * rm debug line * fix tests automatic commit by git-black, original commits: 4eb75ec a37374d

drewbanin added 9 commits January 10, 2018 09:49

first cut of date partitioning

9ac56b7

cleanup, implement partitioning in materialization

c7c87d2

update requirements.txt

1429469

wip for date partitioning with range

8a34440

log data

4dfc2c8

arg handling, logging, cleanup + view compat for new bq version

7a487df

add partitioning tests, compatibility with bq 0.29.0 release

18c33de

pep8

81dee76

fix for strange error in appveyor

5e98af6

drewbanin requested a review from cmcarthur January 29, 2018 00:28

drewbanin added this to the 0.9.2 milestone Jan 29, 2018

drewbanin added 9 commits January 28, 2018 20:04

debug appveyor...

6cf205e

dumb

29dd93a

debugging weird bq adapter use in pg test

4112529

do not use read_project in bq tests

1bfb420

cleanup connections, initialize bq tests

28757ce

remove debug lines

93556aa

fix integration tests (actually)

31eacac

warning for view creation which clobbers tables

d46823a

add query timeout example for bq

4cbf8c7

drewbanin assigned cmcarthur Jan 30, 2018

cmcarthur requested changes Jan 31, 2018

View reviewed changes

no need to release connections in the adapter

1ff816b

drewbanin added 3 commits February 1, 2018 13:06

partition_date interface change (wip)

99c5347

list of dates for bq dp tables

c80aca2

Merge branch 'development' into bq-date-partitioning

158f245

tiny fixes for crufty dbt_project.yml files

8dc143e

drewbanin added 3 commits February 9, 2018 21:50

rm debug line

154061f

fix tests

b1db121

Merge branch 'development' into bq-date-partitioning

3d57ad7

cmcarthur approved these changes Feb 12, 2018

View reviewed changes

drewbanin merged commit 4eb75ec into development Feb 12, 2018

drewbanin deleted the bq-date-partitioning branch February 12, 2018 21:10

drewbanin mentioned this pull request Feb 28, 2018

bigquery date partitioning of sharded tables #505

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bq date partitioning #641

Bq date partitioning #641

drewbanin commented Jan 19, 2018 •

edited

Loading

cmcarthur commented Jan 31, 2018

cmcarthur left a comment

cmcarthur Jan 31, 2018

drewbanin Jan 31, 2018

cmcarthur Jan 31, 2018

cmcarthur Jan 31, 2018

cmcarthur Jan 31, 2018

cmcarthur Jan 31, 2018

cmcarthur Jan 31, 2018

cmcarthur Jan 31, 2018

cmcarthur Jan 31, 2018

drewbanin commented Jan 31, 2018

cmcarthur commented Feb 1, 2018

drewbanin commented Feb 1, 2018

drewbanin commented Feb 10, 2018

cmcarthur commented Feb 12, 2018

Bq date partitioning #641

Bq date partitioning #641

Conversation

drewbanin commented Jan 19, 2018 • edited Loading

Usage

Run for a single day, hard-coded

Run for a range of days

Dynamically specify partition date(s)

Additional configuration

cmcarthur commented Jan 31, 2018

cmcarthur left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drewbanin commented Jan 31, 2018

cmcarthur commented Feb 1, 2018

drewbanin commented Feb 1, 2018

drewbanin commented Feb 10, 2018

cmcarthur commented Feb 12, 2018

drewbanin commented Jan 19, 2018 •

edited

Loading