Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

max_bytes_before_external_group_by not working with DBT #351

Open
rupesh-plateiq opened this issue Sep 3, 2024 · 5 comments
Open

max_bytes_before_external_group_by not working with DBT #351

rupesh-plateiq opened this issue Sep 3, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@rupesh-plateiq
Copy link

Describe the bug

I am running a dbt model with max_bytes_before_external_group_by settings. Still, I am getting Memory Error. If the run the same query on ClickHouse cloud it works perfectly fine

SELECT invoice_id,
        reason,
        argMax(message, date) AS message,
        MAX(_synced_at) AS _synced_at
FROM stg_invoices_invoicelog
GROUP BY invoice_id, reason
SETTINGS max_bytes_before_external_group_by=1000000000;

DBT model definition

{{
    config(
        materialized='incremental',
        incremental_strategy='delete+insert',
        order_by='invoice_id',
        unique_key='invoice_id',
        query_settings={'max_bytes_before_external_group_by': 1000000000}
    )
}}

SELECT invoice_id,
        reason,
        argMax(message, date) AS message,
        MAX(_synced_at) AS _synced_at
FROM stg_invoices_invoicelog
GROUP BY invoice_id, reason

Expected behaviour

It should not give error

Configuration

Environment

  • dbt version: 1.0.0.38.11
  • dbt-clickhouse version: 1.8.1
  • clickhouse-driver version (if using native)
  • clickhouse-connect version (if using HTTP): 0.7.8
  • Python version: 3.8
  • Operating system: macOs

ClickHouse server

  • ClickHouse Server version: 24.6.1.4410
  • ClickHouse Server non-default settings, if any:
  • CREATE TABLE statements for tables involved:
  • Sample data for these tables, use clickhouse-obfuscator if necessary
@rupesh-plateiq rupesh-plateiq added the bug Something isn't working label Sep 3, 2024
@genzgd
Copy link
Contributor

genzgd commented Sep 3, 2024

Can you check the ClickHouse query log (system.query_log) for the dbt version of the query? The query log should include the settings and you can compare the two queries directly.

@genzgd
Copy link
Contributor

genzgd commented Sep 3, 2024

There's also a major difference between a DBT "incremental" model and a simple ClickHouse SELECT.

@rupesh-plateiq
Copy link
Author

rupesh-plateiq commented Sep 3, 2024

Can you check the ClickHouse query log (system.query_log) for the dbt version of the query? The query log should include the settings and you can compare the two queries directly.

This is the query that I got from the log. It is not applying the query settings. Am I doing anything incorrect when defining the query settings?
@genzgd

/* {"app": "dbt", "dbt_version": "1.8.5", "profile_name": "ottimate", "target_name": "dev", "node_id": "model.ottimate.int_invoices_invoicelog"} */

            

    
        create table `ottimate`.`int_invoices_invoicelog`
        
  
        
  engine = MergeTree()
        order by (invoice_id)
        
        
        
        
                    -- end_of_sql
                    SETTINGS  replicated_deduplication_window=0

                    
            empty
          as (
            

SELECT invoice_id,
        reason,
        argMax(message, date) AS message,
        MAX(_synced_at) AS _synced_at
FROM stg_invoices_invoicelog
GROUP BY invoice_id, reason
          )
        

@rupesh-plateiq
Copy link
Author

rupesh-plateiq commented Sep 3, 2024

There's also a major difference between a DBT "incremental" model and a simple ClickHouse SELECT.

This is kind of full refresh. I have cleaned the logic for incremental model to test if the simple query works.

Thanks for your quick reply.

@genzgd
Copy link
Contributor

genzgd commented Sep 3, 2024

Thanks for the detailed log data! That looks like a bug in "full refresh", where an "intermediate" table is created and then swapped, but that intermediate table doesn't appear to have the query_settings applied. We'll see if we can pin down the actual problem over the next few days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants