Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DatabricksNotebookOperator fails when task_key is longer than 100 characters #41816

Open
2 tasks done
rawwar opened this issue Aug 28, 2024 · 11 comments · May be fixed by #43106
Open
2 tasks done

DatabricksNotebookOperator fails when task_key is longer than 100 characters #41816

rawwar opened this issue Aug 28, 2024 · 11 comments · May be fixed by #43106

Comments

@rawwar
Copy link
Collaborator

rawwar commented Aug 28, 2024

Apache Airflow version

main (development)

If "Other Airflow 2 version" selected, which one?

No response

What happened?

According to the Databricks API documentation, task_key has a max length of 100: Link .

When the Dag ID and task ID strings are long enough, we create a task_key with more than 100 characters. However, this limit does not affect during job creation. Job gets created with the full name. But, when fetching using the job run details using getrun endpoint, it truncates the task_key. This is causing issue in the following line of code to cause key error: Link

What you think should happen instead?

task key should be unique. Hence, we can include an uuid, instead of using dag_id+task_id

How to reproduce

have a dag_id and task_id names to be longer than 100 characters together and use DatabricksNotebookOperator

Operating System

Debian GNU/Linux 12 (bookworm)

Versions of Apache Airflow Providers

apache-airflow-providers-databricks==6.8.0

Deployment

Astronomer

Deployment details

No response

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@rawwar rawwar added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Aug 28, 2024
@eladkal
Copy link
Contributor

eladkal commented Aug 28, 2024

If there is a 100 limit why Databricks don't raise error when you create the notebook?

However, this limit does not affect during job creation

This feels more like a feature request to Databricks.
If the API call return exception Airflow will notify users about the problem if they don't then this is not really Airflow problem.
We should avoid trying to solve services problems like this within Airflow.
If Databricks decides to change the limit to 200 or 50. What then? We will need to release a new version of the provider to accommodate. What about all the users who run previous versions?

@Lee-W
Copy link
Member

Lee-W commented Aug 28, 2024

I agree with @eladkal. This should be a feature request to Databricks, but we could have a workaround like @rawwar suggests. WDYT?

@rawwar
Copy link
Collaborator Author

rawwar commented Aug 28, 2024

@eladkal, That makes sense. But this case still needs to be handled on the Airflow side as well since we want a way to track the job. I have a few suggestions here.

suggestions:

  • Clearly mention in the documentation and also print logs when the task_key is beyond 100 characters for now
  • Prefix task_key with a unique identifier so that we can just do a prefix match when searching for the task
  • Instead of dag_id+task_id as the task key, we can use a [5-10] digit UUID that should be sufficiently unique.

@Lee-W Lee-W removed the needs-triage label for new issues that we didn't triage yet label Oct 4, 2024
@Lee-W
Copy link
Member

Lee-W commented Oct 4, 2024

@rawwar I think we at least can do Clearly mention in the documentation and also print logs when the task_key is beyond 100 characters for now. I'm thinking of making Instead of dag_id+task_id as the task key, we can use a [5-10] digit UUID that should be sufficiently unique. optional feature? some users might expect it to be dag_id+task_id.

pankajkoti pushed a commit that referenced this issue Oct 8, 2024
related to #41816

Adds a warning log to indicate failure if length of task_key>100.
majorosdonat pushed a commit to boschglobal/airflow that referenced this issue Oct 10, 2024
related to apache#41816

Adds a warning log to indicate failure if length of task_key>100.
@maciej-szuszkiewicz
Copy link

Hey, I've ran into the same issue today. In our case, we're using in-house DAG factory for generating DAGs from configuration files. This can result in both long dag ids and task ids, as the task ids also contains task groups names.
For example, I have a dag id that's already 81 chars long, and in addition to that, the DatabricksWorkflowTaskGroup is nested in another group.
So, for me the task key generated by DatabricksTaskBaseOperator._get_databricks_task_id is 125 chars long, and I have no way of shortening it.

When I try to run this dag, DatabricksWorkflowTaskGroup.launch operator fails with:

requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://<redacted>.cloud.databricks.com/api/2.1/jobs/create
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 465, in _execute_task
    result = _execute_callable(context=context, **execute_callable_kwargs)
  File "/usr/local/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 432, in _execute_callable
    return execute_callable(context=context, **execute_callable_kwargs)
  File "/usr/local/lib/python3.9/site-packages/airflow/models/baseoperator.py", line 401, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/airflow/providers/databricks/operators/databricks_workflow.py", line 201, in execute
    job_id = self._create_or_reset_job(context)
  File "/usr/local/lib/python3.9/site-packages/airflow/providers/databricks/operators/databricks_workflow.py", line 178, in _create_or_reset_job
    job_id = self._hook.create_job(job_spec)
  File "/usr/local/lib/python3.9/site-packages/airflow/providers/databricks/hooks/databricks.py", line 226, in create_job
    response = self._do_api_call(CREATE_ENDPOINT, json)
  File "/usr/local/lib/python3.9/site-packages/airflow/providers/databricks/hooks/databricks_base.py", line 579, in _do_api_call
    raise AirflowException(msg)
airflow.exceptions.AirflowException: Response: {"error_code":"INVALID_PARAMETER_VALUE","message":"The provided task key (of 125 characters) exceeds the maximum allowed length of 100 characters."}, Status Code: 400

I see three options here:

  • let users configure the task_key on their own in DatabricksTaskBaseOperator.
    I was able to set task_key in DatabricksTaskOperator.task_config attibute. DatabricksWorkflowTaskGroup.launch executed successfully and created a job in Databricks. However, the execution of that DatabricksTaskOperator failed as it was looking for a Databricks task with key generated by _get_databricks_task_id, which didn't exist in that job.

  • remove dag_id from DatabricksTaskBaseOperator._get_databricks_task_id - dag_id adds nothing to the uniqueness of the results returned from _get_databricks_task_id, as it's the same value for all tasks. Only the task id matters. But this is an incomplete fix, as it won't fix the issue in all cases, for example deeply nested groups which will cause Airflow task id to be longer than 100 chars

  • trim _get_databricks_task_id return value to the last 100 characters like return f"{self.dag_id}__{task_id.replace('.', '__')}"[-100:]. It may not return super pretty values for longer ids, but should do the trick.

kunaljubce pushed a commit to kunaljubce/airflow that referenced this issue Oct 13, 2024
related to apache#41816

Adds a warning log to indicate failure if length of task_key>100.
@potiuk
Copy link
Member

potiuk commented Oct 13, 2024

Warning when it happens is already merged. Should we keep that issue open ? Or should we close it after this is addressed better - everyone?

@rawwar
Copy link
Collaborator Author

rawwar commented Oct 13, 2024

If we agree on an approach, I can work on this one. So far, I like the idea of the task key being passed by the user. If user doesn't provide one, we generate a random id(possibly by using an uuid)

@potiuk
Copy link
Member

potiuk commented Oct 14, 2024

Assigned to you - I like the idea too.

@pankajkoti
Copy link
Member

Another idea I have @rawwar is to check if we can compute a hash of the task ID built using the current combination of DAG ID task ID instead of a completely random UUID. If we generate a random UUID, we would also need to store that against each task so that it can be monitored accordingly

@rawwar
Copy link
Collaborator Author

rawwar commented Oct 14, 2024

Another idea I have @rawwar is to check if we can compute a hash of the task ID built using the current combination of DAG ID task ID instead of a completely random UUID. If we generate a random UUID, we would also need to store that against each task so that it can be monitored accordingly

Yeah, I think that's better. I will just encode it using base64 hash it

pavansharma36 pushed a commit to pavansharma36/airflow that referenced this issue Oct 14, 2024
related to apache#41816

Adds a warning log to indicate failure if length of task_key>100.
@potiuk
Copy link
Member

potiuk commented Oct 14, 2024

Agreed

joaopamaral pushed a commit to joaopamaral/airflow that referenced this issue Oct 21, 2024
related to apache#41816

Adds a warning log to indicate failure if length of task_key>100.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants
@potiuk @Lee-W @pankajkoti @rawwar @eladkal @maciej-szuszkiewicz and others