Trying to register a dataset with the python sdk #1091

ashic · 2020-08-06T22:27:55Z

I'm trying to programmatically register a dataset with the python sdk. The dataset will be from a datastore for a storage container, and point to a csv file. I'm trying the following:

dataset = Dataset.Tabular.from_delimited_files(ds.path(relative_path), validate=False)
dataset.register(ws, name)

Now this fails when I run it from archlinux, stating that archlinux is not supported. Going through the stack trace, it looks like the ml sdk is asking the dot net core 2 python package to install additional dependencies at runtime. I've got dot net core 2 and 3 installed on the machine. Is there some way to install whatever's needed up front to not require this download of binaries at runtime? I'll need to run this script as part of a ci pipeline, and it's not really that sensible to download binaries each and every time.

Also, is there a way to register a dataset without having the data accessed, or used locally via dot net or otherwise? I just want a dataset in the azureml workspace - I don't really care about using the data locally (or in the ci pipeline).

Thanks.

The text was updated successfully, but these errors were encountered:

dataders · 2020-08-07T00:44:55Z

I'm pretty sure you can do this from the Azure ML UI (Create Datasets in the Studio). Another alternative would be to use another os, or a Azure ML Compute Instance Notebook VM.

ashic · 2020-08-07T00:57:24Z

I've managed to make some progress. It appears the dependency that's missing is python lttngust. This isn't a pip / conda package, but an optional dependency of dot net core, and is installed at the os level. I'm running archlinux, so

sudo pacman -S python-lttngust

allowed me to progress beyond the point where it was checking for dependencies. Looking through the codebase, it appears that azureml puts dotnetcore2 in site-packages/bin of your python environment, then checks for os level dependencies, and if it can't find them, it copies them for the OSes "supported" from azure blobs, and when everything's ok, it writes a deps/success file. Tbf, this seems quite strange, as a library is effectively downloading stuff at runtime, but ah well...

@swanderz the UI is not an option, as I'm automating this. Another OS may be needed in the build pipeline unless I can get lttngust on our existing images. Either way the automation will be running outside the Azure ML environment.

I'm still not sure whether this brings down data or not, but at the very least, a dataset registration is working.

For those facing similar issues (i.e. "NotImplementedError: Unsupported Linux distribution"), running the following in a python terminal with the same python environment will help you see what the missing dependencies are. If you then install them at the OS level, it should work:

from dotnetcore2 import runtime
runtime._enable_debug_logging()
runtime.ensure_dependencies()

The third instruction will have a debug line with the missing dependencies.

(A similar issue: #713 )

dataders · 2020-08-07T01:46:30Z

@ashic you rock for:

being a problem-solver, and
sharing your findings back with the community
gold star for you!
cc: @MayMSFT

v-strudm-msft added MLOps product-question labels Aug 6, 2020

lostmygithubaccount closed this as completed Aug 25, 2020

lostmygithubaccount mentioned this issue Feb 8, 2021

[meta issue] azureml-dataprep has odd requirements #1328

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying to register a dataset with the python sdk #1091

Trying to register a dataset with the python sdk #1091

ashic commented Aug 6, 2020

dataders commented Aug 7, 2020

ashic commented Aug 7, 2020

dataders commented Aug 7, 2020

Trying to register a dataset with the python sdk #1091

Trying to register a dataset with the python sdk #1091

Comments

ashic commented Aug 6, 2020

dataders commented Aug 7, 2020

ashic commented Aug 7, 2020

dataders commented Aug 7, 2020