Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying to register a dataset with the python sdk #1091

Closed
ashic opened this issue Aug 6, 2020 · 3 comments
Closed

Trying to register a dataset with the python sdk #1091

ashic opened this issue Aug 6, 2020 · 3 comments

Comments

@ashic
Copy link

ashic commented Aug 6, 2020

I'm trying to programmatically register a dataset with the python sdk. The dataset will be from a datastore for a storage container, and point to a csv file. I'm trying the following:

dataset = Dataset.Tabular.from_delimited_files(ds.path(relative_path), validate=False)
dataset.register(ws, name)

Now this fails when I run it from archlinux, stating that archlinux is not supported. Going through the stack trace, it looks like the ml sdk is asking the dot net core 2 python package to install additional dependencies at runtime. I've got dot net core 2 and 3 installed on the machine. Is there some way to install whatever's needed up front to not require this download of binaries at runtime? I'll need to run this script as part of a ci pipeline, and it's not really that sensible to download binaries each and every time.

Also, is there a way to register a dataset without having the data accessed, or used locally via dot net or otherwise? I just want a dataset in the azureml workspace - I don't really care about using the data locally (or in the ci pipeline).

Thanks.

@dataders
Copy link

dataders commented Aug 7, 2020

I'm pretty sure you can do this from the Azure ML UI (Create Datasets in the Studio). Another alternative would be to use another os, or a Azure ML Compute Instance Notebook VM.

@ashic
Copy link
Author

ashic commented Aug 7, 2020

I've managed to make some progress. It appears the dependency that's missing is python lttngust. This isn't a pip / conda package, but an optional dependency of dot net core, and is installed at the os level. I'm running archlinux, so

sudo pacman -S python-lttngust

allowed me to progress beyond the point where it was checking for dependencies. Looking through the codebase, it appears that azureml puts dotnetcore2 in site-packages/bin of your python environment, then checks for os level dependencies, and if it can't find them, it copies them for the OSes "supported" from azure blobs, and when everything's ok, it writes a deps/success file. Tbf, this seems quite strange, as a library is effectively downloading stuff at runtime, but ah well...

@swanderz the UI is not an option, as I'm automating this. Another OS may be needed in the build pipeline unless I can get lttngust on our existing images. Either way the automation will be running outside the Azure ML environment.

I'm still not sure whether this brings down data or not, but at the very least, a dataset registration is working.

For those facing similar issues (i.e. "NotImplementedError: Unsupported Linux distribution"), running the following in a python terminal with the same python environment will help you see what the missing dependencies are. If you then install them at the OS level, it should work:

from dotnetcore2 import runtime
runtime._enable_debug_logging()
runtime.ensure_dependencies()

The third instruction will have a debug line with the missing dependencies.

(A similar issue: #713 )

@dataders
Copy link

dataders commented Aug 7, 2020

@ashic you rock for:

  1. being a problem-solver, and
  2. sharing your findings back with the community
    gold star for you!
    cc: @MayMSFT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants