Fix: Missing a MetadataConfigs init when the repo has a `datasets_info.json` but no README #6164

clefourrier · 2023-08-21T14:57:54Z

When I try to push to an arrow repo (can provide the link on Slack), it uploads the files but fails to update the metadata, with

  File "app.py", line 123, in add_new_eval
    eval_results[level].push_to_hub(my_repo, token=TOKEN, split=SPLIT)
  File "blabla_my_env_path/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 5501, in push_to_hub
    if not metadata_configs:
UnboundLocalError: local variable 'metadata_configs' referenced before assignment

This fixes it.

…on` but no README

HuggingFaceDocBuilderDev · 2023-08-21T15:04:24Z

The documentation is not available anymore as the PR was closed or merged.

github-actions · 2023-08-21T15:04:41Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006874 / 0.011353 (-0.004479)	0.004276 / 0.011008 (-0.006732)	0.085198 / 0.038508 (0.046690)	0.084281 / 0.023109 (0.061171)	0.344767 / 0.275898 (0.068869)	0.377798 / 0.323480 (0.054318)	0.005656 / 0.007986 (-0.002330)	0.003601 / 0.004328 (-0.000727)	0.065486 / 0.004250 (0.061235)	0.056191 / 0.037052 (0.019139)	0.351412 / 0.258489 (0.092923)	0.398591 / 0.293841 (0.104750)	0.031662 / 0.128546 (-0.096884)	0.008901 / 0.075646 (-0.066745)	0.290423 / 0.419271 (-0.128849)	0.053793 / 0.043533 (0.010260)	0.347968 / 0.255139 (0.092829)	0.376978 / 0.283200 (0.093778)	0.026745 / 0.141683 (-0.114938)	1.514119 / 1.452155 (0.061964)	1.580920 / 1.492716 (0.088203)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.273648 / 0.018006 (0.255642)	0.575176 / 0.000490 (0.574686)	0.003557 / 0.000200 (0.003357)	0.000093 / 0.000054 (0.000038)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.031714 / 0.037411 (-0.005697)	0.089166 / 0.014526 (0.074640)	0.101525 / 0.176557 (-0.075032)	0.161855 / 0.737135 (-0.575281)	0.101391 / 0.296338 (-0.194947)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.380947 / 0.215209 (0.165738)	3.800527 / 2.077655 (1.722873)	1.820789 / 1.504120 (0.316669)	1.657327 / 1.541195 (0.116132)	1.776242 / 1.468490 (0.307752)	0.486954 / 4.584777 (-4.097823)	3.688340 / 3.745712 (-0.057372)	3.354453 / 5.269862 (-1.915409)	2.119995 / 4.565676 (-2.445682)	0.057446 / 0.424275 (-0.366829)	0.007752 / 0.007607 (0.000145)	0.461907 / 0.226044 (0.235862)	4.617870 / 2.268929 (2.348942)	2.337025 / 55.444624 (-53.107599)	1.964770 / 6.876477 (-4.911707)	2.252066 / 2.142072 (0.109993)	0.591585 / 4.805227 (-4.213642)	0.134655 / 6.500664 (-6.366009)	0.060646 / 0.075469 (-0.014823)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.263271 / 1.841788 (-0.578517)	20.822286 / 8.074308 (12.747978)	14.710256 / 10.191392 (4.518864)	0.167285 / 0.680424 (-0.513139)	0.018302 / 0.534201 (-0.515899)	0.401023 / 0.579283 (-0.178260)	0.428956 / 0.434364 (-0.005407)	0.466120 / 0.540337 (-0.074218)	0.637868 / 1.386936 (-0.749069)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007174 / 0.011353 (-0.004179)	0.004418 / 0.011008 (-0.006590)	0.065731 / 0.038508 (0.027223)	0.090457 / 0.023109 (0.067348)	0.387306 / 0.275898 (0.111408)	0.427178 / 0.323480 (0.103698)	0.005699 / 0.007986 (-0.002286)	0.003662 / 0.004328 (-0.000666)	0.066190 / 0.004250 (0.061940)	0.062860 / 0.037052 (0.025808)	0.388855 / 0.258489 (0.130366)	0.427853 / 0.293841 (0.134012)	0.032770 / 0.128546 (-0.095776)	0.008780 / 0.075646 (-0.066866)	0.071156 / 0.419271 (-0.348116)	0.050174 / 0.043533 (0.006641)	0.385254 / 0.255139 (0.130115)	0.405069 / 0.283200 (0.121869)	0.025561 / 0.141683 (-0.116122)	1.506907 / 1.452155 (0.054752)	1.543270 / 1.492716 (0.050554)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.304651 / 0.018006 (0.286645)	0.577269 / 0.000490 (0.576780)	0.004479 / 0.000200 (0.004279)	0.000127 / 0.000054 (0.000073)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.034070 / 0.037411 (-0.003341)	0.097664 / 0.014526 (0.083138)	0.106969 / 0.176557 (-0.069588)	0.163093 / 0.737135 (-0.574043)	0.109384 / 0.296338 (-0.186955)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.414823 / 0.215209 (0.199614)	4.148390 / 2.077655 (2.070735)	2.114038 / 1.504120 (0.609918)	1.959316 / 1.541195 (0.418121)	2.098138 / 1.468490 (0.629648)	0.486338 / 4.584777 (-4.098439)	3.642850 / 3.745712 (-0.102863)	3.458311 / 5.269862 (-1.811551)	2.185662 / 4.565676 (-2.380014)	0.057555 / 0.424275 (-0.366720)	0.007522 / 0.007607 (-0.000085)	0.497975 / 0.226044 (0.271931)	4.971528 / 2.268929 (2.702600)	2.614087 / 55.444624 (-52.830537)	2.288406 / 6.876477 (-4.588070)	2.564067 / 2.142072 (0.421995)	0.582248 / 4.805227 (-4.222979)	0.134931 / 6.500664 (-6.365733)	0.062689 / 0.075469 (-0.012780)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.343331 / 1.841788 (-0.498457)	21.398950 / 8.074308 (13.324642)	14.620971 / 10.191392 (4.429579)	0.169779 / 0.680424 (-0.510644)	0.018683 / 0.534201 (-0.515518)	0.396152 / 0.579283 (-0.183131)	0.409596 / 0.434364 (-0.024768)	0.482875 / 0.540337 (-0.057463)	0.659977 / 1.386936 (-0.726959)

lhoestq

Good catch !

github-actions · 2023-08-21T16:27:05Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006662 / 0.011353 (-0.004691)	0.003959 / 0.011008 (-0.007049)	0.084447 / 0.038508 (0.045939)	0.070267 / 0.023109 (0.047158)	0.310301 / 0.275898 (0.034403)	0.339866 / 0.323480 (0.016386)	0.004008 / 0.007986 (-0.003977)	0.003270 / 0.004328 (-0.001058)	0.064997 / 0.004250 (0.060746)	0.053151 / 0.037052 (0.016099)	0.327867 / 0.258489 (0.069378)	0.368560 / 0.293841 (0.074719)	0.031436 / 0.128546 (-0.097111)	0.008547 / 0.075646 (-0.067099)	0.288513 / 0.419271 (-0.130758)	0.051833 / 0.043533 (0.008300)	0.312660 / 0.255139 (0.057521)	0.347180 / 0.283200 (0.063980)	0.024982 / 0.141683 (-0.116701)	1.472487 / 1.452155 (0.020333)	1.550138 / 1.492716 (0.057422)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.208443 / 0.018006 (0.190437)	0.451927 / 0.000490 (0.451437)	0.004452 / 0.000200 (0.004252)	0.000082 / 0.000054 (0.000027)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.029164 / 0.037411 (-0.008247)	0.085801 / 0.014526 (0.071275)	0.096229 / 0.176557 (-0.080327)	0.153063 / 0.737135 (-0.584072)	0.097712 / 0.296338 (-0.198626)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.383969 / 0.215209 (0.168760)	3.829216 / 2.077655 (1.751561)	1.854466 / 1.504120 (0.350346)	1.684149 / 1.541195 (0.142954)	1.759422 / 1.468490 (0.290932)	0.480229 / 4.584777 (-4.104548)	3.653363 / 3.745712 (-0.092349)	3.264456 / 5.269862 (-2.005406)	2.020579 / 4.565676 (-2.545097)	0.056920 / 0.424275 (-0.367355)	0.007625 / 0.007607 (0.000018)	0.458559 / 0.226044 (0.232515)	4.580288 / 2.268929 (2.311359)	2.353783 / 55.444624 (-53.090841)	1.967223 / 6.876477 (-4.909253)	2.182707 / 2.142072 (0.040634)	0.631341 / 4.805227 (-4.173886)	0.141656 / 6.500664 (-6.359008)	0.059918 / 0.075469 (-0.015551)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.279635 / 1.841788 (-0.562153)	19.725763 / 8.074308 (11.651455)	14.477946 / 10.191392 (4.286554)	0.164360 / 0.680424 (-0.516064)	0.018286 / 0.534201 (-0.515915)	0.394935 / 0.579283 (-0.184348)	0.419638 / 0.434364 (-0.014726)	0.460366 / 0.540337 (-0.079972)	0.636876 / 1.386936 (-0.750060)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006568 / 0.011353 (-0.004785)	0.004270 / 0.011008 (-0.006738)	0.065522 / 0.038508 (0.027014)	0.071597 / 0.023109 (0.048487)	0.394929 / 0.275898 (0.119031)	0.427548 / 0.323480 (0.104068)	0.005320 / 0.007986 (-0.002665)	0.003366 / 0.004328 (-0.000962)	0.065780 / 0.004250 (0.061530)	0.055390 / 0.037052 (0.018338)	0.397950 / 0.258489 (0.139461)	0.435800 / 0.293841 (0.141959)	0.031816 / 0.128546 (-0.096730)	0.008555 / 0.075646 (-0.067091)	0.072110 / 0.419271 (-0.347161)	0.049077 / 0.043533 (0.005544)	0.390065 / 0.255139 (0.134926)	0.410294 / 0.283200 (0.127094)	0.023389 / 0.141683 (-0.118294)	1.491491 / 1.452155 (0.039336)	1.551057 / 1.492716 (0.058341)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.243869 / 0.018006 (0.225862)	0.451961 / 0.000490 (0.451471)	0.019834 / 0.000200 (0.019634)	0.000114 / 0.000054 (0.000059)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.031031 / 0.037411 (-0.006380)	0.088189 / 0.014526 (0.073663)	0.101743 / 0.176557 (-0.074814)	0.155236 / 0.737135 (-0.581899)	0.101245 / 0.296338 (-0.195094)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.422178 / 0.215209 (0.206969)	4.199989 / 2.077655 (2.122334)	2.228816 / 1.504120 (0.724696)	2.057172 / 1.541195 (0.515978)	2.162651 / 1.468490 (0.694161)	0.491186 / 4.584777 (-4.093591)	3.666221 / 3.745712 (-0.079491)	3.289531 / 5.269862 (-1.980331)	2.050027 / 4.565676 (-2.515650)	0.057464 / 0.424275 (-0.366811)	0.007379 / 0.007607 (-0.000228)	0.506532 / 0.226044 (0.280487)	5.066385 / 2.268929 (2.797456)	2.694405 / 55.444624 (-52.750219)	2.372200 / 6.876477 (-4.504277)	2.562724 / 2.142072 (0.420652)	0.615474 / 4.805227 (-4.189753)	0.148284 / 6.500664 (-6.352380)	0.061380 / 0.075469 (-0.014089)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.332649 / 1.841788 (-0.509139)	20.591063 / 8.074308 (12.516755)	14.105253 / 10.191392 (3.913861)	0.151886 / 0.680424 (-0.528537)	0.018200 / 0.534201 (-0.516001)	0.395278 / 0.579283 (-0.184005)	0.407113 / 0.434364 (-0.027251)	0.473168 / 0.540337 (-0.067170)	0.660766 / 1.386936 (-0.726170)

…o.json` but no README (#6164) MetadataConfigs not initialized when the repo has a `datasets_info.json` but no README

MetadataConfigs not initialized when the repo has a `datasets_info.js…

1fd2234

…on` but no README

clefourrier requested a review from polinaeterna August 21, 2023 15:03

clefourrier requested a review from lhoestq August 21, 2023 15:40

lhoestq approved these changes Aug 21, 2023

View reviewed changes

lhoestq merged commit 8b8e6ee into main Aug 21, 2023
10 of 13 checks passed

lhoestq deleted the clefourrier-patch-1 branch August 21, 2023 16:18

albertvillanova pushed a commit that referenced this pull request Oct 24, 2023

Fix: Missing a MetadataConfigs init when the repo has a `datasets_inf…

59ce85c

…o.json` but no README (#6164) MetadataConfigs not initialized when the repo has a `datasets_info.json` but no README

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Missing a MetadataConfigs init when the repo has a `datasets_info.json` but no README #6164

Fix: Missing a MetadataConfigs init when the repo has a `datasets_info.json` but no README #6164

clefourrier commented Aug 21, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 21, 2023 •

edited

Loading

github-actions bot commented Aug 21, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

lhoestq left a comment

github-actions bot commented Aug 21, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Fix: Missing a MetadataConfigs init when the repo has a datasets_info.json but no README #6164

Fix: Missing a MetadataConfigs init when the repo has a datasets_info.json but no README #6164

Conversation

clefourrier commented Aug 21, 2023 • edited Loading

HuggingFaceDocBuilderDev commented Aug 21, 2023 • edited Loading

github-actions bot commented Aug 21, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

lhoestq left a comment

Choose a reason for hiding this comment

github-actions bot commented Aug 21, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Fix: Missing a MetadataConfigs init when the repo has a `datasets_info.json` but no README #6164

Fix: Missing a MetadataConfigs init when the repo has a `datasets_info.json` but no README #6164

clefourrier commented Aug 21, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 21, 2023 •

edited

Loading