dbt should refresh connections before they are re-used #782

drewbanin · 2018-05-23T17:38:22Z

....or, it should not use connections more than once?

In practice, these connections can somehow become closed, resulting in:

2018-05-23 11:07:22,549: On master: select tablename as name, schemaname as schema, 'table' as type from pg_tables
        where schemaname ilike 'analytics'
        union all
        select viewname as name, schemaname as schema, 'view' as type from pg_views
        where schemaname ilike 'analytics'
2018-05-23 11:23:39,616: Postgres error: SSL SYSCALL error: Connection timed out

I think avoiding the use of the master connection can help here. Or, is there some way to re-open the connection if we find that it is closed?

The text was updated successfully, but these errors were encountered:

ryanmcdonnell · 2018-06-02T02:43:02Z

@drewbanin I believe this is the same issue I'm experiencing that we recently discussed on Slack. It seems like anytime my dbt run takes longer than 20 minutes, the job hangs and ultimately fails with the "connection timed out" error. I can provide dbt.log if that is helpful.

drewbanin · 2018-06-03T13:27:01Z

Thanks @ryanmcdonnell - we're going to have to do something about this. I think some of the changes we made in 0.10.1 helped, but looks like there are definitely still some issues. This is in our queue for our next release.

In your dbt.log file, can you see the query that results in connection timed out? If you could tell me what type of query it is (a model create table query, or a dbt introspection query, for example), that would be super helpful

ryanmcdonnell · 2018-06-11T18:38:48Z

Output of dbt --version and dbt run during our CI process:

[04:00:05] :	 [Step 1/1] installed version: 0.10.1
[04:00:05] :	 [Step 1/1]    latest version: 0.10.1
[04:00:05] :	 [Step 1/1] Up to date!
[04:00:44] :	 [Step 1/1] Found 194 models, 261 tests, 1 archives, 0 analyses, 117 macros, 2 operations, 1 seed files
[04:00:44] :	 [Step 1/1] 
[04:00:44] :	 [Step 1/1] 11:00:44 | Concurrency: 1 threads (target='prod')
[04:00:44] :	 [Step 1/1] 11:00:44 | 
[04:00:44] :	 [Step 1/1] 11:00:44 | 1 of 12 START view model analytics.snowplow_base_web_page_context.... [RUN]
[04:00:45] :	 [Step 1/1] 11:00:45 | 1 of 12 OK created view model analytics.snowplow_base_web_page_context [PASS in 0.80s]
[04:00:45] :	 [Step 1/1] 11:00:45 | 2 of 12 START view model analytics.sp_base_events_fivetran........... [RUN]
[04:00:46] :	 [Step 1/1] 11:00:46 | 2 of 12 OK created view model analytics.sp_base_events_fivetran...... [PASS in 0.48s]
[04:00:46] :	 [Step 1/1] 11:00:46 | 3 of 12 START table model analytics.snowplow_web_page_context........ [RUN]
[06:12:33] :	 [Step 1/1] Encountered an error:
[06:12:33] :	 [Step 1/1] connection already closed

Here's the last bit of dbt.log:

2018-06-11 11:00:52,325: Writing runtime SQL for node "model.snowplow.snowplow_web_page_context"
2018-06-11 11:00:52,326: Using redshift connection "snowplow_web_page_context".
2018-06-11 11:00:52,326: On snowplow_web_page_context: create temporary table
    "snowplow_web_page_context__dbt_tmp"
    diststyle key distkey (page_view_id)
    
      compound sortkey(page_view_id)
  as (
    -- This one is a little tougher to make incremental
-- because there's no timestamp field here. We could
-- relocate the event collector_tstamp (by root_id)
-- onto snowplow_base_web_page_context, but that would
-- likely mitigate any performance gains!



with web_page_context as (

    select * from "analytics"."snowplow_base_web_page_context"

),

prep as (

    select
        root_id,
        id as page_view_id

    from web_page_context
    group by 1,2

),

duplicated as (

    select
        root_id

    from prep
    group by 1
    having count(*) > 1

)

select * from prep where root_id not in (select root_id from duplicated)
  );

        
        

        insert into "analytics"."snowplow_web_page_context" ("root_id", "page_view_id") (
          select "root_id", "page_view_id"
          from "snowplow_web_page_context__dbt_tmp"
        );
2018-06-11 13:12:33,760: Postgres error: SSL SYSCALL error: Connection timed out

2018-06-11 13:12:33,761: On snowplow_web_page_context: ROLLBACK
2018-06-11 13:12:33,761: Failed to release connection!
2018-06-11 13:12:33,761: On snowplow_web_page_context: ROLLBACK
2018-06-11 13:12:33,762: Connection 'master' was left open.
2018-06-11 13:12:33,762: Connection 'snowplow_web_page_context' was left open.
2018-06-11 13:12:33,763: Encountered an error:
2018-06-11 13:12:33,763: connection already closed
2018-06-11 13:12:33,766: Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/dbt/main.py", line 41, in main
    results, succeeded = handle_and_check(args)
  File "/usr/local/lib/python3.6/site-packages/dbt/main.py", line 85, in handle_and_check
    task, res = run_from_args(parsed)
  File "/usr/local/lib/python3.6/site-packages/dbt/main.py", line 139, in run_from_args
    results = run_from_task(task, proj, parsed)
  File "/usr/local/lib/python3.6/site-packages/dbt/main.py", line 147, in run_from_task
    result = task.run()
  File "/usr/local/lib/python3.6/site-packages/dbt/task/run.py", line 26, in run
    results = runner.run(query, ModelRunner)
  File "/usr/local/lib/python3.6/site-packages/dbt/runner.py", line 217, in run
    return self.run_from_graph(Selector, Runner, query)
  File "/usr/local/lib/python3.6/site-packages/dbt/runner.py", line 203, in run_from_graph
    res = self.execute_nodes(linker, Runner, flat_graph, dep_list)
  File "/usr/local/lib/python3.6/site-packages/dbt/runner.py", line 127, in execute_nodes
    for result in pool.imap_unordered(self.call_runner, args_list):
  File "/usr/local/lib/python3.6/multiprocessing/pool.py", line 735, in next
    raise value
  File "/usr/local/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.6/site-packages/dbt/runner.py", line 81, in call_runner
    result = runner.safe_run(flat_graph)
  File "/usr/local/lib/python3.6/site-packages/dbt/node_runners.py", line 142, in safe_run
    self.adapter.release_connection(self.profile, node_name)
  File "/usr/local/lib/python3.6/site-packages/dbt/adapters/default/impl.py", line 490, in release_connection
    cls.rollback(to_release)
  File "/usr/local/lib/python3.6/site-packages/dbt/adapters/default/impl.py", line 609, in rollback
    connection.get('handle').rollback()
psycopg2.InterfaceError: connection already closed

drewbanin · 2018-06-28T18:42:10Z

TODO : Check if there's a way to add a tcp keepalive to the psycopg2 conn. Additionally, dbt should never use a master connection, or, it should confirm the conn is alive before querying

ryanmcdonnell · 2018-06-29T22:45:54Z

@drewbanin Per your suggestions on Slack, I checked the RedShift docs that pertain to TCP keepalives and confirmed that the Azure machine was already configured well past the limits suggested. I also set Docker to bypass it's networking by binding directly to the host network. Still no dice. Ultimately I moved the dbt runner into an EC2 instance in the same VPC as RedShift and connected to RedShift using the private IP so that there is no public network between them. Haven't had any issues since.

drewbanin · 2018-06-29T22:58:27Z

Thanks for the update @ryanmcdonnell. Glad to hear you got it working, but upset that "use EC2" was the only viable solution. This is in the milestone for our next release, so we're definitely going to see if there's anything we can do to fix this from within dbt

cmcarthur · 2018-07-03T16:56:22Z

have fixed similar issues in the past with os level settings:

echo 120 > /proc/sys/net/ipv4/tcp_keepalive_time

where a middleman (NAT) was disconnecting me.

could be this ^^ if so, figure out if there is a way to set a tcp keepalive on psycopg2?

cmcarthur · 2018-07-03T17:01:22Z

spend 4 hours and figure out what we should do here

beckjake · 2018-07-05T16:10:30Z

It appears that we could insert keepalive values into the connection string, based on the libpq documentation. Those map to keyword arguments passed to psycopg2.connect as explained here. The interesting ones are:

keepalives (defaults to true)
keepalives_idle (120? defaults to system default)
keepalives_interval (120? defaults to system default)
keepalives_count (probably leave this alone unless our other changes cause problems)

Deciding what values to set is probably the hard part.

drewbanin added bug Something isn't working redshift labels May 23, 2018

drewbanin added this to the Betsy Ross (unreleased) milestone May 23, 2018

cmcarthur added the estimate: 4 label Jul 3, 2018

beckjake self-assigned this Jul 5, 2018

beckjake mentioned this issue Jul 6, 2018

Set TCP keepalive on Redshift (#782) #826

Merged

beckjake closed this as completed in #826 Jul 11, 2018

beckjake added a commit that referenced this issue Jul 11, 2018

Set TCP keepalive on Redshift (#782) (#826)

19be61a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dbt should refresh connections before they are re-used #782

dbt should refresh connections before they are re-used #782

drewbanin commented May 23, 2018

ryanmcdonnell commented Jun 2, 2018

drewbanin commented Jun 3, 2018

ryanmcdonnell commented Jun 11, 2018

drewbanin commented Jun 28, 2018

ryanmcdonnell commented Jun 29, 2018

drewbanin commented Jun 29, 2018

cmcarthur commented Jul 3, 2018

cmcarthur commented Jul 3, 2018

beckjake commented Jul 5, 2018 •

edited

Loading

dbt should refresh connections before they are re-used #782

dbt should refresh connections before they are re-used #782

Comments

drewbanin commented May 23, 2018

ryanmcdonnell commented Jun 2, 2018

drewbanin commented Jun 3, 2018

ryanmcdonnell commented Jun 11, 2018

drewbanin commented Jun 28, 2018

ryanmcdonnell commented Jun 29, 2018

drewbanin commented Jun 29, 2018

cmcarthur commented Jul 3, 2018

cmcarthur commented Jul 3, 2018

beckjake commented Jul 5, 2018 • edited Loading

beckjake commented Jul 5, 2018 •

edited

Loading