Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading of zarr dataset fails due to missing "ETag" in server response. #17

Open
observingClouds opened this issue Jun 19, 2022 · 3 comments

Comments

@observingClouds
Copy link
Contributor

What happened
While trying to open the dataset zarr dataset bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu with

import xarray as xr
xr.open_dataset("ipfs://bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu", engine="zarr")

a KeyError is sometimes raised:

~/mambaforge/envs/how_to_eurec4a/lib/python3.8/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, backend_kwargs, *args, **kwargs)
    493 
    494     overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 495     backend_ds = backend.open_dataset(
    496         filename_or_obj,
    497         drop_variables=drop_variables,

~/mambaforge/envs/how_to_eurec4a/lib/python3.8/site-packages/xarray/backends/zarr.py in open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, group, mode, synchronizer, consolidated, chunk_store, storage_options, stacklevel)
    798 
    799         filename_or_obj = _normalize_path(filename_or_obj)
--> 800         store = ZarrStore.open_group(
    801             filename_or_obj,
    802             group=group,

~/mambaforge/envs/how_to_eurec4a/lib/python3.8/site-packages/xarray/backends/zarr.py in open_group(cls, store, mode, synchronizer, group, consolidated, consolidate_on_close, chunk_store, storage_options, append_dim, write_region, safe_chunks, stacklevel)
    363                     stacklevel=stacklevel,
    364                 )
--> 365                 zarr_group = zarr.open_group(store, **open_kwargs)
    366         elif consolidated:
    367             # TODO: an option to pass the metadata_key keyword

~/mambaforge/envs/how_to_eurec4a/lib/python3.8/site-packages/zarr/hierarchy.py in open_group(store, mode, cache_attrs, synchronizer, path, chunk_store, storage_options)
   1165 
   1166     # handle polymorphic store arg
-> 1167     store = _normalize_store_arg(
   1168         store, storage_options=storage_options, mode=mode
   1169     )

~/mambaforge/envs/how_to_eurec4a/lib/python3.8/site-packages/zarr/hierarchy.py in _normalize_store_arg(store, storage_options, mode)
   1055     if store is None:
   1056         return MemoryStore()
-> 1057     return normalize_store_arg(store,
   1058                                storage_options=storage_options, mode=mode)
   1059 

~/mambaforge/envs/how_to_eurec4a/lib/python3.8/site-packages/zarr/storage.py in normalize_store_arg(store, storage_options, mode)
    112     if isinstance(store, str):
    113         if "://" in store or "::" in store:
--> 114             return FSStore(store, mode=mode, **(storage_options or {}))
    115         elif storage_options:
    116             raise ValueError("storage_options passed with non-fsspec path")

~/mambaforge/envs/how_to_eurec4a/lib/python3.8/site-packages/zarr/storage.py in __init__(self, url, normalize_keys, key_separator, mode, exceptions, dimension_separator, **storage_options)
   1138         # Pass attributes to array creation
   1139         self._dimension_separator = dimension_separator
-> 1140         if self.fs.exists(self.path) and not self.fs.isdir(self.path):
   1141             raise FSPathExistNotDir(url)
   1142 

~/mambaforge/envs/how_to_eurec4a/lib/python3.8/site-packages/fsspec/asyn.py in wrapper(*args, **kwargs)
     84     def wrapper(*args, **kwargs):
     85         self = obj or args[0]
---> 86         return sync(self.loop, func, *args, **kwargs)
     87 
     88     return wrapper

~/mambaforge/envs/how_to_eurec4a/lib/python3.8/site-packages/fsspec/asyn.py in sync(loop, func, timeout, *args, **kwargs)
     64         raise FSTimeoutError from return_result
     65     elif isinstance(return_result, BaseException):
---> 66         raise return_result
     67     else:
     68         return return_result

~/mambaforge/envs/how_to_eurec4a/lib/python3.8/site-packages/fsspec/asyn.py in _runner(event, coro, result, timeout)
     24         coro = asyncio.wait_for(coro, timeout=timeout)
     25     try:
---> 26         result[0] = await coro
     27     except Exception as ex:
     28         result[0] = ex

~/mambaforge/envs/how_to_eurec4a/lib/python3.8/site-packages/fsspec/asyn.py in _isdir(self, path)
    531     async def _isdir(self, path):
    532         try:
--> 533             return (await self._info(path))["type"] == "directory"
    534         except IOError:
    535             return False

KeyError: 'type'

Expected behaviour
The dataset is returned without any error.

Potential causes
Debugging the above call

by inserting a few print statements into async_ipfs.py
    async def file_info(self, path, session):
        info = {"name": path}
    headers = {"Accept-Encoding": "identity"}  # this ensures correct file size
    res = await self.cid_head(path, session, headers=headers)

    async with res:
        self._raise_not_found_for_status(res, path)
        if res.status != 200:
            # TODO: maybe handle 301 here
            raise FileNotFoundError(path)
        if "Content-Length" in res.headers:
            info["size"] = int(res.headers["Content-Length"])
        elif "Content-Range" in res.headers:
            info["size"] = int(res.headers["Content-Range"].split("/")[1])

        if "ETag" in res.headers:
            etag = res.headers["ETag"].strip("\"")
            info["ETag"] = etag
            if etag.startswith("DirIndex"):
                info["type"] = "directory"
                info["CID"] = etag.split("-")[-1]
            else:
                info["type"] = "file"
                info["CID"] = etag

    print(f"Info: {info}", flush=True)  # debug print
    print(res.status)  # debug print
    print(res.headers)  # debug print
    return info

reveals that the "ETag" is not always returned by the server. While the header looks like

Info: {'name': 'bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu', 'ETag': 'DirIndex-2b567f6r5vvdg_CID-bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu', 'type': 'directory', 'CID': 'bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu'}
200
<CIMultiDictProxy('Server': 'openresty', 'Date': 'Sat, 18 Jun 2022 23:03:06 GMT', 'Content-Type': 'text/html', 
'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'Access-Control-Allow-Methods': 'GET', 
'Etag': '"DirIndex-2b567f6r5vvdg_CID-bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu"', 
'X-Ipfs-Gateway-Host': 'ipfs-bank6-fr2', 
'X-Ipfs-Path': '/ipfs/bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu', 
'X-Ipfs-Roots': 'bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu', 
'X-IPFS-POP': 'ipfs-bank6-fr2', 'Access-Control-Allow-Origin': '*', 
'Access-Control-Allow-Methods': 'GET, POST, OPTIONS',
 'Access-Control-Allow-Headers': 'X-Requested-With, Range, Content-Range, X-Chunked-Output, X-Stream-Output', 
'Access-Control-Expose-Headers': 'Content-Range, X-Chunked-Output, X-Stream-Output', 
'X-IPFS-LB-POP': 'gateway-bank2-fr2',
 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'X-Proxy-Cache': 'MISS')>
Info: {'name': 'bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu', 
'ETag': 'DirIndex-2b567f6r5vvdg_CID-bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu', 
'type': 'directory',
'CID': 'bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu'}

for a successful request, it misses the "ETag" when failing:

Info: {'name': 'bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu'}
200
<CIMultiDictProxy('Server': 'nginx/1.14.0 (Ubuntu)', 'Date': 'Sun, 19 Jun 2022 10:10:27 GMT', 'Content-Type': 'text/html', 
'Connection': 'keep-alive', 'Access-Control-Allow-Headers': 'Content-Type', 
'Access-Control-Allow-Headers': 'Range', 'Access-Control-Allow-Headers': 'User-Agent', 
'Access-Control-Allow-Headers': 'X-Requested-With', 
'Access-Control-Allow-Methods': 'GET', 'Access-Control-Allow-Methods': 'HEAD',
 'Access-Control-Allow-Origin': '*', 'Access-Control-Expose-Headers': 'Content-Range',
 'Access-Control-Expose-Headers': 'X-Chunked-Output', 
'Access-Control-Expose-Headers': 'X-Stream-Output', 
'X-Ipfs-Path': '/ipfs/bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu')>

Without the "ETag" the "type"-Key is not set.

if "ETag" in res.headers:
etag = res.headers["ETag"].strip("\"")
info["ETag"] = etag
if etag.startswith("DirIndex"):
info["type"] = "directory"
info["CID"] = etag.split("-")[-1]
else:
info["type"] = "file"
info["CID"] = etag

Does this mean that the success of the function call seems to depend on which IPFS peer is responding quickest?

@d70-t
Copy link
Collaborator

d70-t commented Jun 20, 2022

This is related to ipfs/kubo#8528: we need a way of telling if a CID or IPFS-path resolves to a directory or to a file (that's needed for fsspec's info()-method as well as isdir(), isfile() etc...

According to the issue mentioned above, cheking the ETag is an awkward but recommended way of doing this. Apparently it does not work in all cases. Probably we'll have to exclude some gateways from out default list, if they dropped support for this or otherwise have to find ways of telling files and directories apart from what we get.

@d70-t
Copy link
Collaborator

d70-t commented Jun 20, 2022

So apparently https://gateway.pinata.cloud doesn't return etags, but is able to deliver the dataset. That's unfortunate, but I don't see a good way of getting what we need for info() from their response. Thus we might have to drop that gateway from the default list...

@observingClouds
Copy link
Contributor Author

Thanks for looking into this! This is a pity, maybe we should approach them and inform them about this issue with their service.

So, a quick solution would be to define the environment variable IPFSSPEC_GATEWAYS and just exclude the piñata gateway or any other gateway that does not provide etags. I can work with that for now, but I agree that the gateway should be dropped from the default list so the UX is better.

SethDocherty added a commit to easierdata/ipfs-stac that referenced this issue Sep 25, 2024
Revised handling of adding env variables. Note  `gateway.pinata.cloud` has been removed. See this issue for more details fsspec/ipfsspec#17 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants