-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snowflake web page context is unnecessarily separated and is not incremental #95
Comments
Sorry for the serious delay in responding here, @davehowell. This is something I've wanted to do for some time (#52). The potential time (+ cost) savings on Snowflake are immense. I think the required pieces are extensive enough that it could (+ should) be part of a larger refactor. |
Unfortunately I have moved on from the previous project and don't have access to Snowplow data to look at this again, but I recall that I created a few override snowplow models in the main dbt project in order to bypass some of this, as well as adding those flattened out fields into some of the special "include" fields that are available as vars. If you don't think it's too much to bite off at one time then sure include it as part of your larger refactor. I have to say the Snowplow models are pretty complicated, a lot to get your head around, and this particular issue is very Snowflake specific so might be easier to do ( and more importantly for the reviewer!) separately. If not then I guess this can be closed. |
Honestly, this is what I've done every time I use Snowflake + Snowplow together. It would obviously be better if the package could "just work" on Snowflake in the way we know to be best. |
Describe the feature
The default models assume that the
web_page_context
data will be landed in a separate table to the raw events. On Snowflake that is not the case; the data is inside a variant field. Could some models for Snowflake be skipped or made unnecessary by just flattening out the field in the source events something likeCONTEXTS_COM_SNOWPLOWANALYTICS_SNOWPLOW_WEB_PAGE_1[0]['id'] as PAGE_VIEW_ID
? Has that been considered?The snowplow_web_page_context model contains this comment:
I have found it necessary anyway to dedupe the data in my raw events table, because if I don't then the
snowplow_sessions_tmp
model fails on merge.Describe alternatives you've considered
To achieve that deduping in my raw events, this was recommended ( by helpful people on the slack channel ):
That works but is incredibly slow so I have had to implement it as incremental, using the collector_tstamp as a watermark.
My concern is with the
web_page_context
model being separated from the events, not incremental, only to be later recombined into a number of other models.If you think there is merit to this I can look at trying to put in Snowflake specific changes to optimise the model flow for that system. Otherwise if this has already been considered and ruled out could you let me know why it might be a bad idea?
Additional context
This is Snowflake specific.
Who will this benefit?
Those with huge amounts of evergrowing events from Snowplow pipelines.
The text was updated successfully, but these errors were encountered: