Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic S3 error: Converting table to pandas and pyarrow table fails. #1256

Open
shazamkash opened this issue Mar 31, 2023 · 5 comments
Open
Assignees
Labels
binding/python Issues for the Python package bug Something isn't working good first issue Good for newcomers

Comments

@shazamkash
Copy link

shazamkash commented Mar 31, 2023

Environment

Delta-rs version: 0.8.1

Binding: Python

Environment:
Docker container:
Python: 3.10.7
OS: Debian GNU/Linux 11 (bullseye)
S3: Non-AWS (Ceph based)


Bug

What happened:
When reading delta table, the table is read fine and also exists. But then converting that table to pandas or from pyarrow dataset to table is failing with the same error below.

I have tried reading the same table with PySpark and it works fine. The parquet file is about 1 GB compressed and 3 GB uncompressed in size. Furthermore, the table was written to deltalake using the same delta-rs version.

Error:

---------------------------------------------------------------------------
PyDeltaTableError                         Traceback (most recent call last)
Cell In[6], line 1
----> 1 dt.to_pandas()

File /opt/conda/lib/python3.10/site-packages/deltalake/table.py:418, in DeltaTable.to_pandas(self, partitions, columns, filesystem)
    404 def to_pandas(
    405     self,
    406     partitions: Optional[List[Tuple[str, str, Any]]] = None,
    407     columns: Optional[List[str]] = None,
    408     filesystem: Optional[Union[str, pa_fs.FileSystem]] = None,
    409 ) -> "pandas.DataFrame":
    410     """
    411     Build a pandas dataframe using data from the DeltaTable.
    412 
   (...)
    416     :return: a pandas dataframe
    417     """
--> 418     return self.to_pyarrow_table(
    419         partitions=partitions, columns=columns, filesystem=filesystem
    420     ).to_pandas()

File /opt/conda/lib/python3.10/site-packages/deltalake/table.py:400, in DeltaTable.to_pyarrow_table(self, partitions, columns, filesystem)
    386 def to_pyarrow_table(
    387     self,
    388     partitions: Optional[List[Tuple[str, str, Any]]] = None,
    389     columns: Optional[List[str]] = None,
    390     filesystem: Optional[Union[str, pa_fs.FileSystem]] = None,
    391 ) -> pyarrow.Table:
    392     """
    393     Build a PyArrow Table using data from the DeltaTable.
    394 
   (...)
    398     :return: the PyArrow table
    399     """
--> 400     return self.to_pyarrow_dataset(
    401         partitions=partitions, filesystem=filesystem
    402     ).to_table(columns=columns)

File /opt/conda/lib/python3.10/site-packages/pyarrow/_dataset.pyx:369, in pyarrow._dataset.Dataset.to_table()

File /opt/conda/lib/python3.10/site-packages/pyarrow/_dataset.pyx:2818, in pyarrow._dataset.Scanner.to_table()

File /opt/conda/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

PyDeltaTableError: Generic S3 error: Error performing get request xxx/yyy/data_3_gb/0-ccc89437-58a8-44a4-aad2-17ffce7dd929-0.parquet: response error "<html><body><h1>429 Too Many Requests</h1>
You have sent too many requests in a given amount of time.
</body></html>
", after 0 retries: HTTP status client error (429 Too Many Requests) for url (https://xxx.yyy.zzz.net/delta-lake-bronze/xxx/yyy/data_3_gb/0-ccc89437-58a8-44a4-aad2-17ffce7dd929-0.parquet)

How to reproduce it:
My Code:

storage_options = {"AWS_ACCESS_KEY_ID": f"{credentials.access_key}", 
                   "AWS_SECRET_ACCESS_KEY": f"{credentials.secret_key}",
                   "AWS_ENDPOINT_URL": "https://xxx.yyy.zzz.net",
                   "AWS_S3_ALLOW_UNSAFE_RENAME": "True",
                  }

table_uri = "s3a://delta-lake-bronze/xxx/yyy/data_3_gb"
dt = dl.DeltaTable(table_uri=table_uri, storage_options=storage_options)

# Coverting to pandas fails
dt.to_pandas()

# Converting from pyarrow dataset to table fails as well
dataset = dt.to_pyarrow_dataset()
dataset.to_table()

More details:
I am not sure if this information helps. But I get the same error when reading using Polars.

@shazamkash shazamkash added the bug Something isn't working label Mar 31, 2023
@roeap
Copy link
Collaborator

roeap commented Apr 1, 2023

@shazamkash - Thanks for reporting this!

From the response you showed it seems like we are running into some sort of throttling on the storage side. Though not quite sure why. Could you see what happens if you configure the pyarrow s3 filesystem adn pass that to to_pyayrrow_dataset? https://delta-io.github.io/delta-rs/python/usage.html#custom-storage-backends.

@shazamkash
Copy link
Author

@roeap

I tried what you asked and please find the code and errors below:

Code:

from pyarrow import fs
import deltalake as dl

storage_options = {"AWS_ACCESS_KEY_ID": f"{credentials.access_key}", 
                   "AWS_SECRET_ACCESS_KEY": f"{credentials.secret_key}",
                   "AWS_ENDPOINT_URL": "https://xxx.yyy.zzz.net",
                   "AWS_S3_ALLOW_UNSAFE_RENAME": "True",
                  }

table_uri = "s3a://delta-lake-bronze/xxx/yyy/data_3_gb"
dt = dl.DeltaTable(table_uri=table_uri, storage_options=storage_options)

s3 = fs.S3FileSystem(access_key=f"{credentials.access_key}",
                                 secret_key=f"{credentials.secret_key}",
                                 endpoint_override='https://xxx.yyy.zzz.net')

# Fails
dataset = dt.to_pyarrow_dataset(filesystem=s3)

# Fails
dataset = dt.to_pandas(filesystem=s3)

Error from dt.to_pyarrow_dataset()

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[28], line 2
      1 dataset = dt.to_pyarrow_dataset(filesystem=s3)
----> 2 dataset.to_table()

File /opt/conda/lib/python3.10/site-packages/pyarrow/_dataset.pyx:369, in pyarrow._dataset.Dataset.to_table()

File /opt/conda/lib/python3.10/site-packages/pyarrow/_dataset.pyx:2818, in pyarrow._dataset.Scanner.to_table()

File /opt/conda/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

File /opt/conda/lib/python3.10/site-packages/pyarrow/error.pxi:115, in pyarrow.lib.check_status()

OSError: Not a regular file: '0-ccc89437-58a8-44a4-aad2-17ffce7dd929-0.parquet'

Error from dt.to_pandas()

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[26], line 1
----> 1 dt.to_pandas(filesystem=s3)

File /opt/conda/lib/python3.10/site-packages/deltalake/table.py:418, in DeltaTable.to_pandas(self, partitions, columns, filesystem)
    404 def to_pandas(
    405     self,
    406     partitions: Optional[List[Tuple[str, str, Any]]] = None,
    407     columns: Optional[List[str]] = None,
    408     filesystem: Optional[Union[str, pa_fs.FileSystem]] = None,
    409 ) -> "pandas.DataFrame":
    410     """
    411     Build a pandas dataframe using data from the DeltaTable.
    412 
   (...)
    416     :return: a pandas dataframe
    417     """
--> 418     return self.to_pyarrow_table(
    419         partitions=partitions, columns=columns, filesystem=filesystem
    420     ).to_pandas()

File /opt/conda/lib/python3.10/site-packages/deltalake/table.py:400, in DeltaTable.to_pyarrow_table(self, partitions, columns, filesystem)
    386 def to_pyarrow_table(
    387     self,
    388     partitions: Optional[List[Tuple[str, str, Any]]] = None,
    389     columns: Optional[List[str]] = None,
    390     filesystem: Optional[Union[str, pa_fs.FileSystem]] = None,
    391 ) -> pyarrow.Table:
    392     """
    393     Build a PyArrow Table using data from the DeltaTable.
    394 
   (...)
    398     :return: the PyArrow table
    399     """
--> 400     return self.to_pyarrow_dataset(
    401         partitions=partitions, filesystem=filesystem
    402     ).to_table(columns=columns)

File /opt/conda/lib/python3.10/site-packages/pyarrow/_dataset.pyx:369, in pyarrow._dataset.Dataset.to_table()

File /opt/conda/lib/python3.10/site-packages/pyarrow/_dataset.pyx:2818, in pyarrow._dataset.Scanner.to_table()

File /opt/conda/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

File /opt/conda/lib/python3.10/site-packages/pyarrow/error.pxi:115, in pyarrow.lib.check_status()

OSError: Not a regular file: '0-ccc89437-58a8-44a4-aad2-17ffce7dd929-0.parquet'

Here is a list of files which I can get running the following code and this works as well

dataset = dt.to_pyarrow_dataset(filesystem=s3)
dataset.files

List of files:

['0-ccc89437-58a8-44a4-aad2-17ffce7dd929-0.parquet',
 '0-ccc89437-58a8-44a4-aad2-17ffce7dd929-1.parquet',
 '0-ccc89437-58a8-44a4-aad2-17ffce7dd929-3.parquet',
 '0-ccc89437-58a8-44a4-aad2-17ffce7dd929-2.parquet']

@shazamkash
Copy link
Author

shazamkash commented Apr 5, 2023

@roeap

Another thing I noticed is that, this only happens with data which is "big" in size few 100's MB to few GB and is split into multiple parquet files . I can read the tables which are very small like few 10's of MB and are save in singular file.

Any help would be appreciated in this matter. Because I have read the same data before with delta-rs with an older version and back then it worked fine. Unfortunately I don't remember now what was the exact delta-rs version.

Also here is the full error which I was able to get now:

---------------------------------------------------------------------------
PyDeltaTableError                         Traceback (most recent call last)
Cell In[6], line 1
----> 1 dt.to_pandas()

File /opt/conda/lib/python3.10/site-packages/deltalake/table.py:418, in DeltaTable.to_pandas(self, partitions, columns, filesystem)
    404 def to_pandas(
    405     self,
    406     partitions: Optional[List[Tuple[str, str, Any]]] = None,
    407     columns: Optional[List[str]] = None,
    408     filesystem: Optional[Union[str, pa_fs.FileSystem]] = None,
    409 ) -> "pandas.DataFrame":
    410     """
    411     Build a pandas dataframe using data from the DeltaTable.
    412 
   (...)
    416     :return: a pandas dataframe
    417     """
--> 418     return self.to_pyarrow_table(
    419         partitions=partitions, columns=columns, filesystem=filesystem
    420     ).to_pandas()

File /opt/conda/lib/python3.10/site-packages/deltalake/table.py:400, in DeltaTable.to_pyarrow_table(self, partitions, columns, filesystem)
    386 def to_pyarrow_table(
    387     self,
    388     partitions: Optional[List[Tuple[str, str, Any]]] = None,
    389     columns: Optional[List[str]] = None,
    390     filesystem: Optional[Union[str, pa_fs.FileSystem]] = None,
    391 ) -> pyarrow.Table:
    392     """
    393     Build a PyArrow Table using data from the DeltaTable.
    394 
   (...)
    398     :return: the PyArrow table
    399     """
--> 400     return self.to_pyarrow_dataset(
    401         partitions=partitions, filesystem=filesystem
    402     ).to_table(columns=columns)

File /opt/conda/lib/python3.10/site-packages/pyarrow/_dataset.pyx:369, in pyarrow._dataset.Dataset.to_table()

File /opt/conda/lib/python3.10/site-packages/pyarrow/_dataset.pyx:2818, in pyarrow._dataset.Scanner.to_table()

File /opt/conda/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

PyDeltaTableError: Generic S3 error: Error performing get request xxx/yyy/data_3_gb/0-ccc89437-58a8-44a4-aad2-17ffce7dd929-0.parquet: response error "<html><body><h1>429 Too Many Requests</h1>
You have sent too many requests in a given amount of time.
</body></html>
", after 0 retries: HTTP status client error (429 Too Many Requests) for url (https://xxx.yyy.zzz.net/delta-lake-bronze/xxx/yyy/data_3_gb/0-ccc89437-58a8-44a4-aad2-17ffce7dd929-0.parquet)

@tsafacjo
Copy link

tsafacjo commented Nov 1, 2023

can I take it ?

@roeap
Copy link
Collaborator

roeap commented Nov 9, 2023

@tsafacjo - certainly :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants