Skip to content

Releases: aws/aws-sdk-pandas

AWS SDK for pandas 2.17.0

20 Sep 23:11
3bcd8d3
Compare
Choose a tag to compare

New Functionalities

Enhancements

  • Returning empty DataFrame for empty TimeStream query #1430
  • Added support for INSERT IGNORE for mysql.to_sql #1429
  • Added use_column_names to redshift.copy akin to redshift.to_sql #1437
  • Enable passing kwargs to redshift.connect #1467
  • Add timestream_endpoint_url property to the config #1483
  • Add support for upserting to an empty Glue table #1579

Documentation

  • Fix typos in documentation #1434

Bug Fix

  • validate_schema=True for wr.s3.read_parquet breaks with partition columns and dataset=True #1426
  • wr.neptune.to_property_graph failing for Neptune version 1.1.1.0 #1407
  • ValueError when using opensearch.index_df with documents with an array field #1444
  • Missing catalog_id in wr.catalog.create_database #1480
  • Check for pair of brackets in query preparation for Athena cache #1529
  • Fix wrong type hint for TagColumnOperation in quicksight.create_athena_dataset #1570
  • s3.to_json compression parameters is passed twice when dataset=True #1585
  • Cast Athena array, map & struct types to pandas object #1581
  • In the OpenSearch module, use SSL only for HTTPS (port 443) #1603

Noteworthy

AWS Lambda Managed Layers

Since the last release, the library has been accepted as an official SDK for AWS, and rebranded as AWS SDK for pandas 🚀. The module names in Python will remain the same. One noteworthy change, however, is that the AWS Lambda Manager layer name has been renamed from AWSDataWrangler to AWSSDKPandas.

You can view the ARN value for the layers here.

PyArrow 7 Support

⚠️ For platforms without PyArrow 7 support (e.g. MWAA, EMR, Glue PySpark Job):

pip install pyarrow==2 awswrangler

Thanks

We thank the following contributors/users for their work on this release:

@bechbd, @maxispeicher, @timgates42, @aeeladawy, @KhueNgocDang, @szemek, @malachi-constant, @cnfait, @jaidisido, @LeonLuttenberger, @kukushking

3.0.0a2

17 Aug 10:35
b471c5c
Compare
Choose a tag to compare
3.0.0a2 Pre-release
Pre-release

This is a pre-release for the Wrangler@Scale project

What's Changed

Full Changelog: 3.0.0a1...3.0.0a2

3.0.0a1

17 Aug 10:06
b4d13bf
Compare
Choose a tag to compare
3.0.0a1 Pre-release
Pre-release

This is a pre-release for the Wrangler@Scale project

What's Changed

  • (feat): Add distributed config flag and initialise method by @jaidisido in #1389
  • (feat): Add distributed Lake Formation read by @jaidisido in #1397
  • (feat): Distribute S3 select over multiple paths and scan ranges by @jaidisido in #1445
  • (refactor): Refactor threading/ray; add single-path distributed s3 select impl by @kukushking in #1446

Full Changelog: 2.16.1...3.0.0a1

2.16.1

28 Jun 16:39
Compare
Choose a tag to compare

Noteworthy

🐛 Fixed issue introduced by 2.16.0 to method s3.read_parquet()

Patch

  • Fix bug: pq_file.schema.names(): TypeError: 'list' object is not callable s3.read_parquet() #1412

P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

Full Changelog: 2.16.0...2.16.1

AWS Data Wrangler 2.16.0

22 Jun 18:21
Compare
Choose a tag to compare

Noteworthy

⚠️ For platforms without PyArrow 7 support (e.g. MWAA, EMR, Glue PySpark Job):

➡️ pip install pyarrow==2 awswrangler

New Functionalities

  • Add support for Oracle Database 🔥 #1259 Check out the tutorial.

Enhancements

  • add test infrastructure for oracle database #1274
  • revisiting S3 Select performance #1287
  • migrate test infra from cdk v1 to cdk v2 #1288
  • to_sql() make column names quoted identifiers to allow sql keywords #1392
  • throw NoFilesFound exception on 404 #1290
  • fast executemany #1299
  • add precombine key to upsert method for Redshift #1304
  • pass precombine to redshift.copy() #1319
  • use DataFrame column names in INSERT statement for UPSERT operation #1317
  • add data_source param to athena.repair_table #1324
  • modify athena2quicksight datatypes to allow startswith for varchar #1332
  • add TagColumnOperation to quicksight.create_athena_dataset #1342
  • enable list timestream databases and tables #1345
  • enable s3.to_parquet to receive "zstd" compression type #1369
  • create a way to perform PartiQL queries to a Dynamo DB table #1390
  • s3 proxy support with data wrangler #1361

Documentation

  • be more explicit about awswrangler.s3.to_parquet overwrite behavior #1300
  • fix Python Version in Readme #1302

Bug Fix

  • set encoding to utf-8 when no encoding is specified when reading/writing to s3 #1257
  • fix Redshift Locking Behavior #1305
  • specify cfn deletion policy for sqlserver and oracle instances #1378
  • to_sql() make column names quoted identifiers to allow sql keywords #1392
  • fix extension dtype index handling #1333
  • fix issue with redshift.to_sql() method when mode set to "upsert" and schema contains a hyphen #1360
  • timestream - array cols to str #1368
  • read_parquet Does Not Throw Error for Missing Column #1370

Thanks

We thank the following contributors/users for their work on this release:

@bnimam, @IldarAlmakaev, @syokoysn, @IldarAlmakaev, @thomasniebler, @maxdavidson91, @takeknock, @Sleekbobby1011, @snikolakis, @willsmith28, @malachi-constant, @cnfait, @jaidisido, @kukushking


P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

AWS Data Wrangler 2.15.1

11 Apr 15:35
7708c80
Compare
Choose a tag to compare

Noteworthy

⚠️ Dropped Python 3.6 support

⚠️ For platforms without PyArrow 7 support (e.g. MWAA, EMR, Glue PySpark Job):

➡️ pip install pyarrow==2 awswrangler

Patch

  • Add sparql extra & make SPARQLWrapper dependency optional #1252

P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

AWS Data Wrangler 2.15.0

28 Mar 14:36
Compare
Choose a tag to compare

Noteworthy

⚠️ Dropped Python 3.6 support

⚠️ For platforms without PyArrow 7 support (e.g. MWAA, EMR, Glue PySpark Job):

➡️ pip install pyarrow==2 awswrangler

New Functionalities

Enhancements

  • Timestream module - support multi-measure records #1214
  • Warnings for implicit float conversion of nulls in to_parquet #1221
  • Support additional sql params in Redshift COPY operation #1210
  • Add create_ctas_table to Athena module #1207
  • S3 Proxy support #1206
  • Add Athena get_named_query_statement #1183
  • Add manifest parameter to 'redshift.copy_from_files' method #1164

Documentation

  • Update install section #1242
  • Update lambda layers section #1236

Bug Fix

  • Give precedence to user path for Athena UNLOAD S3 Output Location #1216
  • Honor User specified workgroup in athena.read_sql_query with unload_approach=True #1178
  • Support map type in Redshift copy #1185
  • data_api.rds.read_sql_query() does not preserve data type when column is all NULLS - switches to Boolean #1158
  • Allow decimal values within struct when writing to parquet #1179

Thanks

We thank the following contributors/users for their work on this release:

@bechbd, @sakti-mishra, @mateogianolio, @jasadams, @malachi-constant, @cnfait, @jaidisido, @kukushking


P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

AWS Data Wrangler 2.14.0

28 Jan 14:24
7604507
Compare
Choose a tag to compare

Caveats

⚠️ For platforms without PyArrow 6 support (e.g. MWAA, EMR, Glue PySpark Job):

➡️ pip install pyarrow==2 awswrangler

New Functionalities

  • Support Athena Unload 🚀 #1038

Enhancements

  • Add the ExcludeColumnSchema=True argument to the glue.get_partitions call to reduce response size #1094
  • Add PyArrow flavor argument to write_parquet via pyarrow_additional_kwargs #1057
  • Add rename_duplicate_columns and handle_duplicate_columns flag to sanitize_dataframe_columns_names method #1124
  • Add timestamp_as_object argument to all database read_sql_table methods #1130
  • Add ignore_null to read_parquet_metadata method #1125

Documentation

  • Improve documentation on installing SAR Lambda layers with the CDK #1097
  • Fix broken link to tutorial in to_parquet method #1058

Bug Fix

  • Ensure that partition locations retrieved from AWS Glue always end in a "/" #1094
  • Fix bucketing overflow issue in Athena #1086

Thanks

We thank the following contributors/users for their work on this release:

@dennyau, @kailukowiak, @lucasmo, @moykeen, @RigoIce, @vlieven, @kepler, @mdavis-xyz, @ConstantinoSchillebeeckx, @kukushking, @jaidisido


P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

AWS Data Wrangler 2.13.0

03 Dec 20:09
Compare
Choose a tag to compare

Caveats

⚠️ For platforms without PyArrow 6 support (e.g. MWAA, EMR, Glue PySpark Job):

➡️ pip install pyarrow==2 awswrangler

Breaking changes

  • Fix sanitize methods to align with Glue/Hive naming conventions #579

New Functionalities

  • AWS Lake Formation Governed Tables 🚀 #570
  • Support for Python 3.10 🔥 #973
  • Add partitioning to JSON datasets #962
  • Add ability to use unbuffered cursor for large MySQL datasets #928

Enhancements

  • Add awswrangler.s3.list_buckets #997
  • Add partitions_parameters to catalog partitions methods #1035
  • Refactor pagination config in list objects #955
  • Add error message to EmptyDataframe exception #991

Documentation

  • Clarify docs & add tutorial on schema evolution for CSV datasets #964

Bug Fix

  • catalog.add_column() without column_comment triggers exception #1017
  • catalog.create_parquet_table Key in dictionary does not always exist #998
  • Fix Catalog StorageDescriptor get #969

Thanks

We thank the following contributors/users for their work on this release:

@csabz09, @Falydoor, @moritzkoerber, @maxispeicher, @kukushking, @jaidisido


P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

AWS Data Wrangler 2.12.1

18 Oct 12:02
829c306
Compare
Choose a tag to compare

Caveats

⚠️ For platforms without PyArrow 5 support (e.g. MWAA, EMR, Glue PySpark Job):

➡️ pip install pyarrow==2 awswrangler

Patch

  • Removing unnecessary dev dependencies from main #961

P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!