Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Received message larger than max (4199881 vs. 4194304) #74

Closed
pselden opened this issue Oct 27, 2020 · 19 comments
Closed

Received message larger than max (4199881 vs. 4194304) #74

pselden opened this issue Oct 27, 2020 · 19 comments

Comments

@pselden
Copy link

pselden commented Oct 27, 2020

I have a TFX pipeline that runs in Kubeflow on GCP and recently one of my pipelines started failing with the following error in a ResolverNode.latest_model_resolver and ResolverNode.latest_blessed_model_resolver

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/ml_metadata/metadata_store/metadata_store.py", line 165, in _call_method
    response.CopyFrom(grpc_method(request))
  File "/usr/local/lib/python3.7/dist-packages/grpc/_channel.py", line 826, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.7/dist-packages/grpc/_channel.py", line 729, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.RESOURCE_EXHAUSTED
	details = "Received message larger than max (4199881 vs. 4194304)"
	debug_error_string = "{"created":"@1603760693.874743930","description":"Received message larger than max (4199881 vs. 4194304)","file":"src/core/ext/filters/message_size/message_size_filter.cc","file_line":203,"grpc_status":8}"
>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/tfx-src/tfx/orchestration/kubeflow/container_entrypoint.py", line 360, in <module>
    main()
  File "/tfx-src/tfx/orchestration/kubeflow/container_entrypoint.py", line 353, in main
    execution_info = launcher.launch()
  File "/tfx-src/tfx/orchestration/launcher/base_component_launcher.py", line 197, in launch
    self._exec_properties)
  File "/tfx-src/tfx/orchestration/launcher/base_component_launcher.py", line 166, in _run_driver
    component_info=self._component_info)
  File "/tfx-src/tfx/components/common_nodes/resolver_node.py", line 73, in pre_execution
    source_channels=input_dict.copy())
  File "/tfx-src/tfx/dsl/experimental/latest_artifacts_resolver.py", line 56, in resolve
    output_key=c.output_key)
  File "/tfx-src/tfx/orchestration/metadata.py", line 323, in get_qualified_artifacts
    executions = self.store.get_executions_by_context(context.id)
  File "/usr/local/lib/python3.7/dist-packages/ml_metadata/metadata_store/metadata_store.py", line 1080, in get_executions_by_context
    self._call('GetExecutionsByContext', request, response)
  File "/usr/local/lib/python3.7/dist-packages/ml_metadata/metadata_store/metadata_store.py", line 140, in _call
    return self._call_method(method_name, request, response)
  File "/usr/local/lib/python3.7/dist-packages/ml_metadata/metadata_store/metadata_store.py", line 170, in _call_method
    raise _make_exception(e.details(), e.code().value[0])  # pytype: disable=attribute-error
ml_metadata.errors.ResourceExhaustedError: Received message larger than max (4199881 vs. 4194304)

Is there a way to fix this on my side?

@pselden
Copy link
Author

pselden commented Oct 27, 2020

I was able to fix this by deleting some items from the Associations table for the given context.

This is obviously just a bandaid. It seems like the root of this problem is that since there's no way to do filtering in ml-metadata, that TFX has to load ALL executions just to be able to choose the latest one.

@hughmiao
Copy link
Contributor

hi @pselden , you can use the grpc options to increase the size.

/cc @ruoyu90 on the resolver logic refactoring.

We also also working on filtering. Please stay tuned.

@hughmiao
Copy link
Contributor

@pselden, related to the discussion here, we will surface the pagination API to the python client.

@ConverJens
Copy link

hi @pselden , you can use the grpc options to increase the size.

/cc @ruoyu90 on the resolver logic refactoring.

We also also working on filtering. Please stay tuned.

@hughmiao I encountered the same issue when running the Evaluator component in TFX while slicing on continuous features. I'm also running MLMD in Kubeflow, is there a way to set the grpc max message length as a command line option, like grpc-port?

@hughmiao
Copy link
Contributor

@ConverJens , the config can be passed in the tfx pipeline mlmd config settings.

I'm also running MLMD in Kubeflow, is there a way to set the grpc max message length as a command line option, like grpc-port?

For kubeflow deployment, + @dushyanthsc for KFP settings / command line options.

@redramen
Copy link

... we will surface the pagination API to the python client.

@hughmiao do you happen to have an ETA for this? We are building atop MLMD at Twitter and pagination is a must have for us.

@dushyanthsc
Copy link

@redramen I should have the pagination support surfaced in the python client by end of this week.

@hughmiao
Copy link
Contributor

@redramen sg, will prioritize this.

thanks, @dushyanthsc ! Let's follow up in the cl.

/cc @ruoyu90 for tfx side changes if needed.

@ConverJens
Copy link

@ConverJens , the config can be passed in the tfx pipeline mlmd config settings.

I'm also running MLMD in Kubeflow, is there a way to set the grpc max message length as a command line option, like grpc-port?

For kubeflow deployment, + @dushyanthsc for KFP settings / command line options.

@dushyanthsc Any documentation on how to specify this?

@dushyanthsc
Copy link

@ConverJens we plan to use the ListOperationOptions(https:/google/ml-metadata/blob/master/ml_metadata/proto/metadata_store.proto#L638) in gRPC service layer and have the python client get executions by calling the gRPC with page size.

I am working on the CL, and should have a point release out early next week.

tfx-copybara pushed a commit that referenced this issue Nov 25, 2020
…gination and ordering results by ID, create time and last update time fields.

The change further use the exposed options to in get_executions_by_context / get_artifacts_by_context python APIs to address feature request in #74.

PiperOrigin-RevId: 344194942
@dushyanthsc
Copy link

@ConverJens @redramen A solution for the problem is checked in and available at HEAD.

The solution was to use the pagination support and retrieve Executions in page of 100 executions per page and abstract this logic behind get_executions_by_context.

If you can elaborate how you consume MLMD i.e. from source code or released version or through TFX we can decide we need to cut a point release.

@ConverJens
Copy link

ConverJens commented Nov 26, 2020

@dushyanthsc That sounds great for the paging problem. However, my problem was that the proto message from a single Evaluator run was too large, hence I would like to specify this option for MLMD in the KubeFlow installation, not just from a client. Any idea how this can be achieved? Perhaps as a command line option, like grpc-port?

And regarding how we consume MLMD: through Jupyter using the kubeflow-metadata package and through TFX pipelines. The pagination is not a blocker for our use case at the moment.

@dushyanthsc
Copy link

@ConverJens What is the underlying MLMD API call that the Evaluator makes? Can you provide the log of the error you are seeing, that way I can confirm if the change made solves your problem.

For providing the config flag to increase the allowed payload size. You can set the command line parameter [1] which for a kubeflow-metadata deployment gets passed from the deployment manifest [2]

[1] - https:/google/ml-metadata/blob/master/ml_metadata/metadata_store/metadata_store_server_main.cc#L142

[2] - https:/kubeflow/manifests/blob/master/metadata/base/metadata-deployment.yaml#L26

@dushyanthsc
Copy link

Adding @Bobgy to comment on the current state of support for kubeflow-metadata package based on what I see in [1]

[1] - kubeflow/manifests#1638

@Bobgy
Copy link

Bobgy commented Nov 28, 2020

@dushyanthsc Can I confirm the lowest mlmd's version with pagination capabilities?

I think we'll need to upgrade the server too.

Regarding kubeflow-metadata python client package, there's no maintainers any more, so I'd suggest planning for a migration to an alternative

@redramen
Copy link

@ConverJens @redramen A solution for the problem is checked in and available at HEAD.

The solution was to use the pagination support and retrieve Executions in page of 100 executions per page and abstract this logic behind get_executions_by_context.

If you can elaborate how you consume MLMD i.e. from source code or released version or through TFX we can decide we need to cut a point release.

We use the latest versioned release available on pypi so we'll be able to use this whenever a new version is released

@dushyanthsc
Copy link

@Bobgy The pagination support for GetArtifacts, GetExecution and GetContexts were available in the gRPC service from release 0.23.0

@redramen Got it. I will have the point release started today, will update this thread when it is complete.

@hughmiao
Copy link
Contributor

hughmiao commented Dec 1, 2020

thanks, @dushyanthsc . @redramen, in addition to use the 0.25.1 py client release, you also need to use the 0.25.1 server binary which adds the pagination to to that rpc GetExecutionsByContext used in tfx. /cc @Bobgy

@hughmiao
Copy link
Contributor

hughmiao commented Dec 8, 2020

Close the issue, as the wheel and server are released. Please feel free to reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants