Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbt should refresh connections before they are re-used #782

Closed
drewbanin opened this issue May 23, 2018 · 9 comments
Closed

dbt should refresh connections before they are re-used #782

drewbanin opened this issue May 23, 2018 · 9 comments
Assignees
Labels
bug Something isn't working redshift

Comments

@drewbanin
Copy link
Contributor

....or, it should not use connections more than once?

In practice, these connections can somehow become closed, resulting in:

2018-05-23 11:07:22,549: On master: select tablename as name, schemaname as schema, 'table' as type from pg_tables
        where schemaname ilike 'analytics'
        union all
        select viewname as name, schemaname as schema, 'view' as type from pg_views
        where schemaname ilike 'analytics'
2018-05-23 11:23:39,616: Postgres error: SSL SYSCALL error: Connection timed out

I think avoiding the use of the master connection can help here. Or, is there some way to re-open the connection if we find that it is closed?

@drewbanin drewbanin added bug Something isn't working redshift labels May 23, 2018
@drewbanin drewbanin added this to the Betsy Ross (unreleased) milestone May 23, 2018
@ryanmcdonnell
Copy link

@drewbanin I believe this is the same issue I'm experiencing that we recently discussed on Slack. It seems like anytime my dbt run takes longer than 20 minutes, the job hangs and ultimately fails with the "connection timed out" error. I can provide dbt.log if that is helpful.

@drewbanin
Copy link
Contributor Author

Thanks @ryanmcdonnell - we're going to have to do something about this. I think some of the changes we made in 0.10.1 helped, but looks like there are definitely still some issues. This is in our queue for our next release.

In your dbt.log file, can you see the query that results in connection timed out? If you could tell me what type of query it is (a model create table query, or a dbt introspection query, for example), that would be super helpful

@ryanmcdonnell
Copy link

Output of dbt --version and dbt run during our CI process:

[04:00:05] :	 [Step 1/1] installed version: 0.10.1
[04:00:05] :	 [Step 1/1]    latest version: 0.10.1
[04:00:05] :	 [Step 1/1] Up to date!
[04:00:44] :	 [Step 1/1] Found 194 models, 261 tests, 1 archives, 0 analyses, 117 macros, 2 operations, 1 seed files
[04:00:44] :	 [Step 1/1] 
[04:00:44] :	 [Step 1/1] 11:00:44 | Concurrency: 1 threads (target='prod')
[04:00:44] :	 [Step 1/1] 11:00:44 | 
[04:00:44] :	 [Step 1/1] 11:00:44 | 1 of 12 START view model analytics.snowplow_base_web_page_context.... [RUN]
[04:00:45] :	 [Step 1/1] 11:00:45 | 1 of 12 OK created view model analytics.snowplow_base_web_page_context [PASS in 0.80s]
[04:00:45] :	 [Step 1/1] 11:00:45 | 2 of 12 START view model analytics.sp_base_events_fivetran........... [RUN]
[04:00:46] :	 [Step 1/1] 11:00:46 | 2 of 12 OK created view model analytics.sp_base_events_fivetran...... [PASS in 0.48s]
[04:00:46] :	 [Step 1/1] 11:00:46 | 3 of 12 START table model analytics.snowplow_web_page_context........ [RUN]
[06:12:33] :	 [Step 1/1] Encountered an error:
[06:12:33] :	 [Step 1/1] connection already closed

Here's the last bit of dbt.log:

2018-06-11 11:00:52,325: Writing runtime SQL for node "model.snowplow.snowplow_web_page_context"
2018-06-11 11:00:52,326: Using redshift connection "snowplow_web_page_context".
2018-06-11 11:00:52,326: On snowplow_web_page_context: create temporary table
    "snowplow_web_page_context__dbt_tmp"
    diststyle key distkey (page_view_id)
    
      compound sortkey(page_view_id)
  as (
    -- This one is a little tougher to make incremental
-- because there's no timestamp field here. We could
-- relocate the event collector_tstamp (by root_id)
-- onto snowplow_base_web_page_context, but that would
-- likely mitigate any performance gains!



with web_page_context as (

    select * from "analytics"."snowplow_base_web_page_context"

),

prep as (

    select
        root_id,
        id as page_view_id

    from web_page_context
    group by 1,2

),

duplicated as (

    select
        root_id

    from prep
    group by 1
    having count(*) > 1

)

select * from prep where root_id not in (select root_id from duplicated)
  );

        
        

        insert into "analytics"."snowplow_web_page_context" ("root_id", "page_view_id") (
          select "root_id", "page_view_id"
          from "snowplow_web_page_context__dbt_tmp"
        );
2018-06-11 13:12:33,760: Postgres error: SSL SYSCALL error: Connection timed out

2018-06-11 13:12:33,761: On snowplow_web_page_context: ROLLBACK
2018-06-11 13:12:33,761: Failed to release connection!
2018-06-11 13:12:33,761: On snowplow_web_page_context: ROLLBACK
2018-06-11 13:12:33,762: Connection 'master' was left open.
2018-06-11 13:12:33,762: Connection 'snowplow_web_page_context' was left open.
2018-06-11 13:12:33,763: Encountered an error:
2018-06-11 13:12:33,763: connection already closed
2018-06-11 13:12:33,766: Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/dbt/main.py", line 41, in main
    results, succeeded = handle_and_check(args)
  File "/usr/local/lib/python3.6/site-packages/dbt/main.py", line 85, in handle_and_check
    task, res = run_from_args(parsed)
  File "/usr/local/lib/python3.6/site-packages/dbt/main.py", line 139, in run_from_args
    results = run_from_task(task, proj, parsed)
  File "/usr/local/lib/python3.6/site-packages/dbt/main.py", line 147, in run_from_task
    result = task.run()
  File "/usr/local/lib/python3.6/site-packages/dbt/task/run.py", line 26, in run
    results = runner.run(query, ModelRunner)
  File "/usr/local/lib/python3.6/site-packages/dbt/runner.py", line 217, in run
    return self.run_from_graph(Selector, Runner, query)
  File "/usr/local/lib/python3.6/site-packages/dbt/runner.py", line 203, in run_from_graph
    res = self.execute_nodes(linker, Runner, flat_graph, dep_list)
  File "/usr/local/lib/python3.6/site-packages/dbt/runner.py", line 127, in execute_nodes
    for result in pool.imap_unordered(self.call_runner, args_list):
  File "/usr/local/lib/python3.6/multiprocessing/pool.py", line 735, in next
    raise value
  File "/usr/local/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.6/site-packages/dbt/runner.py", line 81, in call_runner
    result = runner.safe_run(flat_graph)
  File "/usr/local/lib/python3.6/site-packages/dbt/node_runners.py", line 142, in safe_run
    self.adapter.release_connection(self.profile, node_name)
  File "/usr/local/lib/python3.6/site-packages/dbt/adapters/default/impl.py", line 490, in release_connection
    cls.rollback(to_release)
  File "/usr/local/lib/python3.6/site-packages/dbt/adapters/default/impl.py", line 609, in rollback
    connection.get('handle').rollback()
psycopg2.InterfaceError: connection already closed

@drewbanin
Copy link
Contributor Author

TODO : Check if there's a way to add a tcp keepalive to the psycopg2 conn. Additionally, dbt should never use a master connection, or, it should confirm the conn is alive before querying

@ryanmcdonnell
Copy link

@drewbanin Per your suggestions on Slack, I checked the RedShift docs that pertain to TCP keepalives and confirmed that the Azure machine was already configured well past the limits suggested. I also set Docker to bypass it's networking by binding directly to the host network. Still no dice. Ultimately I moved the dbt runner into an EC2 instance in the same VPC as RedShift and connected to RedShift using the private IP so that there is no public network between them. Haven't had any issues since.

@drewbanin
Copy link
Contributor Author

Thanks for the update @ryanmcdonnell. Glad to hear you got it working, but upset that "use EC2" was the only viable solution. This is in the milestone for our next release, so we're definitely going to see if there's anything we can do to fix this from within dbt

@cmcarthur
Copy link
Member

have fixed similar issues in the past with os level settings:

echo 120 > /proc/sys/net/ipv4/tcp_keepalive_time

where a middleman (NAT) was disconnecting me.

could be this ^^ if so, figure out if there is a way to set a tcp keepalive on psycopg2?

@cmcarthur
Copy link
Member

spend 4 hours and figure out what we should do here

@beckjake beckjake self-assigned this Jul 5, 2018
@beckjake
Copy link
Contributor

beckjake commented Jul 5, 2018

It appears that we could insert keepalive values into the connection string, based on the libpq documentation. Those map to keyword arguments passed to psycopg2.connect as explained here. The interesting ones are:

  • keepalives (defaults to true)
  • keepalives_idle (120? defaults to system default)
  • keepalives_interval (120? defaults to system default)
  • keepalives_count (probably leave this alone unless our other changes cause problems)

Deciding what values to set is probably the hard part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working redshift
Projects
None yet
Development

No branches or pull requests

4 participants