Deprecate `download_custom` #6093

mariosasko · 2023-07-28T10:49:06Z

Deprecate DownloadManager.download_custom. Users should use fsspec URLs (cacheable) or make direct requests with fsspec/requests (not cacheable) instead.

We should deprecate this method as it's not compatible with streaming, and implementing the streaming version of it is hard/impossible. There have been requests to implement the streaming version of this method on the forum, but the reason for this seems to be a tip in the docs that "promotes" this method (this PR removes it).

…-custom-download

lhoestq

good idea !

HuggingFaceDocBuilderDev · 2023-07-28T10:55:01Z

The documentation is not available anymore as the PR was closed or merged.

github-actions · 2023-07-28T10:57:50Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007498 / 0.011353 (-0.003855)	0.004158 / 0.011008 (-0.006850)	0.087568 / 0.038508 (0.049060)	0.083265 / 0.023109 (0.060156)	0.378505 / 0.275898 (0.102607)	0.399025 / 0.323480 (0.075545)	0.006173 / 0.007986 (-0.001813)	0.003743 / 0.004328 (-0.000586)	0.071958 / 0.004250 (0.067707)	0.059323 / 0.037052 (0.022271)	0.377084 / 0.258489 (0.118595)	0.408358 / 0.293841 (0.114517)	0.035191 / 0.128546 (-0.093356)	0.009408 / 0.075646 (-0.066238)	0.312587 / 0.419271 (-0.106685)	0.058073 / 0.043533 (0.014540)	0.381977 / 0.255139 (0.126838)	0.395611 / 0.283200 (0.112411)	0.024191 / 0.141683 (-0.117491)	1.572735 / 1.452155 (0.120581)	1.687186 / 1.492716 (0.194470)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.208886 / 0.018006 (0.190879)	0.474625 / 0.000490 (0.474135)	0.006261 / 0.000200 (0.006061)	0.000093 / 0.000054 (0.000038)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.031401 / 0.037411 (-0.006011)	0.086433 / 0.014526 (0.071907)	0.108405 / 0.176557 (-0.068152)	0.174564 / 0.737135 (-0.562571)	0.099932 / 0.296338 (-0.196407)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.407059 / 0.215209 (0.191850)	4.102056 / 2.077655 (2.024401)	1.975397 / 1.504120 (0.471277)	1.807117 / 1.541195 (0.265922)	1.908667 / 1.468490 (0.440177)	0.525880 / 4.584777 (-4.058897)	3.899639 / 3.745712 (0.153927)	4.358664 / 5.269862 (-0.911198)	2.586185 / 4.565676 (-1.979492)	0.061967 / 0.424275 (-0.362308)	0.007656 / 0.007607 (0.000049)	0.504851 / 0.226044 (0.278807)	5.004429 / 2.268929 (2.735500)	2.515540 / 55.444624 (-52.929084)	2.183142 / 6.876477 (-4.693334)	2.369835 / 2.142072 (0.227763)	0.623527 / 4.805227 (-4.181700)	0.145105 / 6.500664 (-6.355559)	0.063924 / 0.075469 (-0.011546)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.472661 / 1.841788 (-0.369126)	21.781655 / 8.074308 (13.707347)	15.628820 / 10.191392 (5.437428)	0.182342 / 0.680424 (-0.498082)	0.021139 / 0.534201 (-0.513062)	0.438610 / 0.579283 (-0.140673)	0.451343 / 0.434364 (0.016979)	0.563320 / 0.540337 (0.022983)	0.740976 / 1.386936 (-0.645960)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007492 / 0.011353 (-0.003861)	0.004429 / 0.011008 (-0.006579)	0.068517 / 0.038508 (0.030008)	0.078533 / 0.023109 (0.055424)	0.383530 / 0.275898 (0.107632)	0.435061 / 0.323480 (0.111581)	0.005955 / 0.007986 (-0.002030)	0.003645 / 0.004328 (-0.000683)	0.068792 / 0.004250 (0.064541)	0.062452 / 0.037052 (0.025399)	0.408768 / 0.258489 (0.150279)	0.438538 / 0.293841 (0.144697)	0.032038 / 0.128546 (-0.096508)	0.009196 / 0.075646 (-0.066450)	0.074495 / 0.419271 (-0.344776)	0.051322 / 0.043533 (0.007789)	0.394458 / 0.255139 (0.139319)	0.424763 / 0.283200 (0.141564)	0.024890 / 0.141683 (-0.116793)	1.568322 / 1.452155 (0.116167)	1.703903 / 1.492716 (0.211187)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.249630 / 0.018006 (0.231624)	0.471412 / 0.000490 (0.470923)	0.000435 / 0.000200 (0.000235)	0.000060 / 0.000054 (0.000005)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.033054 / 0.037411 (-0.004358)	0.100150 / 0.014526 (0.085624)	0.101704 / 0.176557 (-0.074853)	0.164031 / 0.737135 (-0.573104)	0.112497 / 0.296338 (-0.183841)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.487150 / 0.215209 (0.271941)	4.662335 / 2.077655 (2.584681)	2.477285 / 1.504120 (0.973165)	2.294033 / 1.541195 (0.752838)	2.380143 / 1.468490 (0.911653)	0.519182 / 4.584777 (-4.065595)	3.983589 / 3.745712 (0.237877)	3.669895 / 5.269862 (-1.599967)	2.267147 / 4.565676 (-2.298529)	0.063300 / 0.424275 (-0.360975)	0.008839 / 0.007607 (0.001232)	0.566766 / 0.226044 (0.340721)	5.533475 / 2.268929 (3.264546)	3.033412 / 55.444624 (-52.411212)	2.701793 / 6.876477 (-4.174684)	2.899444 / 2.142072 (0.757372)	0.614236 / 4.805227 (-4.190991)	0.139533 / 6.500664 (-6.361131)	0.067537 / 0.075469 (-0.007932)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.505572 / 1.841788 (-0.336216)	22.859062 / 8.074308 (14.784754)	15.044777 / 10.191392 (4.853385)	0.169153 / 0.680424 (-0.511271)	0.021027 / 0.534201 (-0.513174)	0.447979 / 0.579283 (-0.131304)	0.460676 / 0.434364 (0.026312)	0.506327 / 0.540337 (-0.034010)	0.737880 / 1.386936 (-0.649057)

github-actions · 2023-07-28T10:57:55Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006118 / 0.011353 (-0.005235)	0.003692 / 0.011008 (-0.007316)	0.080606 / 0.038508 (0.042098)	0.062014 / 0.023109 (0.038905)	0.391886 / 0.275898 (0.115988)	0.423978 / 0.323480 (0.100498)	0.004968 / 0.007986 (-0.003017)	0.002911 / 0.004328 (-0.001417)	0.062867 / 0.004250 (0.058617)	0.049493 / 0.037052 (0.012441)	0.395656 / 0.258489 (0.137167)	0.432406 / 0.293841 (0.138565)	0.027242 / 0.128546 (-0.101304)	0.007938 / 0.075646 (-0.067709)	0.261703 / 0.419271 (-0.157569)	0.045922 / 0.043533 (0.002389)	0.391544 / 0.255139 (0.136405)	0.417902 / 0.283200 (0.134703)	0.021339 / 0.141683 (-0.120344)	1.508391 / 1.452155 (0.056236)	1.518970 / 1.492716 (0.026254)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.181159 / 0.018006 (0.163153)	0.431402 / 0.000490 (0.430912)	0.003849 / 0.000200 (0.003649)	0.000068 / 0.000054 (0.000014)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.024498 / 0.037411 (-0.012914)	0.072758 / 0.014526 (0.058233)	0.084910 / 0.176557 (-0.091646)	0.148314 / 0.737135 (-0.588821)	0.085212 / 0.296338 (-0.211126)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.386693 / 0.215209 (0.171484)	3.852652 / 2.077655 (1.774997)	1.891758 / 1.504120 (0.387638)	1.718793 / 1.541195 (0.177598)	1.747595 / 1.468490 (0.279104)	0.498593 / 4.584777 (-4.086184)	3.057907 / 3.745712 (-0.687805)	4.728449 / 5.269862 (-0.541413)	2.966368 / 4.565676 (-1.599308)	0.057538 / 0.424275 (-0.366737)	0.006415 / 0.007607 (-0.001192)	0.461652 / 0.226044 (0.235608)	4.625944 / 2.268929 (2.357015)	2.306938 / 55.444624 (-53.137686)	1.974670 / 6.876477 (-4.901806)	2.146327 / 2.142072 (0.004254)	0.585033 / 4.805227 (-4.220195)	0.125936 / 6.500664 (-6.374728)	0.062365 / 0.075469 (-0.013104)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.263415 / 1.841788 (-0.578373)	18.380651 / 8.074308 (10.306343)	13.853410 / 10.191392 (3.662018)	0.144674 / 0.680424 (-0.535749)	0.016833 / 0.534201 (-0.517368)	0.330812 / 0.579283 (-0.248471)	0.357553 / 0.434364 (-0.076810)	0.383529 / 0.540337 (-0.156809)	0.558923 / 1.386936 (-0.828013)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006074 / 0.011353 (-0.005278)	0.003655 / 0.011008 (-0.007353)	0.062981 / 0.038508 (0.024473)	0.061457 / 0.023109 (0.038348)	0.366471 / 0.275898 (0.090573)	0.408463 / 0.323480 (0.084983)	0.004854 / 0.007986 (-0.003132)	0.002916 / 0.004328 (-0.001412)	0.062745 / 0.004250 (0.058494)	0.051136 / 0.037052 (0.014084)	0.380313 / 0.258489 (0.121824)	0.416945 / 0.293841 (0.123104)	0.027228 / 0.128546 (-0.101318)	0.008031 / 0.075646 (-0.067615)	0.067941 / 0.419271 (-0.351331)	0.042886 / 0.043533 (-0.000647)	0.370112 / 0.255139 (0.114973)	0.397111 / 0.283200 (0.113911)	0.023063 / 0.141683 (-0.118620)	1.476955 / 1.452155 (0.024800)	1.534783 / 1.492716 (0.042066)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.231462 / 0.018006 (0.213456)	0.439559 / 0.000490 (0.439069)	0.000364 / 0.000200 (0.000164)	0.000056 / 0.000054 (0.000002)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026925 / 0.037411 (-0.010486)	0.079623 / 0.014526 (0.065097)	0.088694 / 0.176557 (-0.087862)	0.143163 / 0.737135 (-0.593972)	0.089900 / 0.296338 (-0.206438)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.451429 / 0.215209 (0.236220)	4.510723 / 2.077655 (2.433069)	2.491853 / 1.504120 (0.987733)	2.334670 / 1.541195 (0.793475)	2.395519 / 1.468490 (0.927029)	0.501369 / 4.584777 (-4.083408)	3.014019 / 3.745712 (-0.731693)	2.809199 / 5.269862 (-2.460662)	1.842195 / 4.565676 (-2.723481)	0.057675 / 0.424275 (-0.366600)	0.006742 / 0.007607 (-0.000865)	0.524402 / 0.226044 (0.298358)	5.245296 / 2.268929 (2.976367)	2.957990 / 55.444624 (-52.486634)	2.649807 / 6.876477 (-4.226670)	2.755909 / 2.142072 (0.613836)	0.589610 / 4.805227 (-4.215617)	0.125708 / 6.500664 (-6.374956)	0.062237 / 0.075469 (-0.013232)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.362758 / 1.841788 (-0.479030)	18.343694 / 8.074308 (10.269386)	13.621521 / 10.191392 (3.430129)	0.128866 / 0.680424 (-0.551558)	0.016608 / 0.534201 (-0.517593)	0.333071 / 0.579283 (-0.246212)	0.341917 / 0.434364 (-0.092447)	0.381075 / 0.540337 (-0.159263)	0.512485 / 1.386936 (-0.874451)

mariosasko · 2023-07-28T11:00:35Z

I forgot to mention this in the initial comment, but only one public dataset (excluding gated) uses this method - pg19, which I just fixed.

github-actions · 2023-07-28T11:40:36Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007838 / 0.011353 (-0.003515)	0.004791 / 0.011008 (-0.006217)	0.102596 / 0.038508 (0.064088)	0.087678 / 0.023109 (0.064569)	0.373858 / 0.275898 (0.097960)	0.416643 / 0.323480 (0.093163)	0.006147 / 0.007986 (-0.001839)	0.003837 / 0.004328 (-0.000491)	0.076706 / 0.004250 (0.072456)	0.063449 / 0.037052 (0.026396)	0.378392 / 0.258489 (0.119903)	0.431768 / 0.293841 (0.137927)	0.036648 / 0.128546 (-0.091898)	0.010042 / 0.075646 (-0.065604)	0.350277 / 0.419271 (-0.068995)	0.062892 / 0.043533 (0.019359)	0.376151 / 0.255139 (0.121012)	0.420929 / 0.283200 (0.137729)	0.027816 / 0.141683 (-0.113867)	1.791607 / 1.452155 (0.339452)	1.903045 / 1.492716 (0.410328)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.224688 / 0.018006 (0.206682)	0.491941 / 0.000490 (0.491451)	0.004482 / 0.000200 (0.004282)	0.000102 / 0.000054 (0.000048)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.033495 / 0.037411 (-0.003917)	0.099855 / 0.014526 (0.085329)	0.114593 / 0.176557 (-0.061964)	0.190947 / 0.737135 (-0.546189)	0.116202 / 0.296338 (-0.180136)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.488581 / 0.215209 (0.273372)	4.869531 / 2.077655 (2.791876)	2.527920 / 1.504120 (1.023800)	2.340021 / 1.541195 (0.798826)	2.432661 / 1.468490 (0.964171)	0.569646 / 4.584777 (-4.015131)	4.392036 / 3.745712 (0.646324)	4.987253 / 5.269862 (-0.282608)	2.866604 / 4.565676 (-1.699073)	0.067393 / 0.424275 (-0.356882)	0.008759 / 0.007607 (0.001152)	0.584327 / 0.226044 (0.358283)	5.853000 / 2.268929 (3.584072)	3.206721 / 55.444624 (-52.237904)	2.730867 / 6.876477 (-4.145610)	2.944814 / 2.142072 (0.802742)	0.703336 / 4.805227 (-4.101891)	0.173985 / 6.500664 (-6.326679)	0.075333 / 0.075469 (-0.000137)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.519755 / 1.841788 (-0.322033)	22.918038 / 8.074308 (14.843730)	17.211160 / 10.191392 (7.019768)	0.196941 / 0.680424 (-0.483483)	0.021833 / 0.534201 (-0.512368)	0.476835 / 0.579283 (-0.102448)	0.464513 / 0.434364 (0.030149)	0.559180 / 0.540337 (0.018843)	0.748232 / 1.386936 (-0.638704)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008461 / 0.011353 (-0.002892)	0.004799 / 0.011008 (-0.006209)	0.077466 / 0.038508 (0.038958)	0.103562 / 0.023109 (0.080453)	0.453661 / 0.275898 (0.177763)	0.531126 / 0.323480 (0.207647)	0.006618 / 0.007986 (-0.001367)	0.004048 / 0.004328 (-0.000280)	0.075446 / 0.004250 (0.071196)	0.072815 / 0.037052 (0.035762)	0.497145 / 0.258489 (0.238656)	0.533828 / 0.293841 (0.239987)	0.037657 / 0.128546 (-0.090890)	0.010139 / 0.075646 (-0.065507)	0.083759 / 0.419271 (-0.335512)	0.061401 / 0.043533 (0.017868)	0.441785 / 0.255139 (0.186646)	0.491678 / 0.283200 (0.208479)	0.033100 / 0.141683 (-0.108583)	1.753612 / 1.452155 (0.301458)	1.838956 / 1.492716 (0.346240)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.395023 / 0.018006 (0.377017)	0.509362 / 0.000490 (0.508872)	0.060742 / 0.000200 (0.060542)	0.000545 / 0.000054 (0.000491)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.039327 / 0.037411 (0.001916)	0.117345 / 0.014526 (0.102819)	0.124540 / 0.176557 (-0.052017)	0.200743 / 0.737135 (-0.536392)	0.126750 / 0.296338 (-0.169589)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.488597 / 0.215209 (0.273388)	4.875534 / 2.077655 (2.797880)	2.714364 / 1.504120 (1.210244)	2.603707 / 1.541195 (1.062513)	2.733547 / 1.468490 (1.265057)	0.575183 / 4.584777 (-4.009594)	4.126096 / 3.745712 (0.380384)	3.853803 / 5.269862 (-1.416058)	2.395160 / 4.565676 (-2.170516)	0.067391 / 0.424275 (-0.356884)	0.009108 / 0.007607 (0.001501)	0.585865 / 0.226044 (0.359820)	5.864878 / 2.268929 (3.595949)	3.153369 / 55.444624 (-52.291256)	2.759064 / 6.876477 (-4.117413)	3.032489 / 2.142072 (0.890416)	0.702615 / 4.805227 (-4.102613)	0.160034 / 6.500664 (-6.340630)	0.077294 / 0.075469 (0.001825)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.595069 / 1.841788 (-0.246719)	23.231191 / 8.074308 (15.156883)	16.365137 / 10.191392 (6.173745)	0.188360 / 0.680424 (-0.492063)	0.021704 / 0.534201 (-0.512497)	0.469996 / 0.579283 (-0.109287)	0.463255 / 0.434364 (0.028891)	0.560506 / 0.540337 (0.020169)	0.751006 / 1.386936 (-0.635930)

ProgramComputer · 2023-08-20T05:13:24Z

@mariosasko How would you stream a split zip file with just download_and_extract or download? With download_custom, it is possible to combine a split zip file. Perhaps add an option in download to combine split zips. This issue may apply to other multipart file-types.

Edit -
In case asked why I use split zips, I haven't been able to upload zips larger than 50 GB to HuggingFace.

Edit2 -
Issue is tackled for split zips.

* Deprecate `download_custom` * Better msg

mariosasko added 3 commits July 28, 2023 12:08

Deprecate download_custom

54cd710

Better msg

db7180e

Merge branch 'main' of github.com:huggingface/datasets into deprecate…

ab3f016

…-custom-download

lhoestq approved these changes Jul 28, 2023

View reviewed changes

mariosasko merged commit 50d9a70 into main Jul 28, 2023
13 checks passed

mariosasko deleted the deprecate-custom-download branch July 28, 2023 11:30

albertvillanova pushed a commit that referenced this pull request Oct 24, 2023

Deprecate download_custom (#6093)

7641e9b

* Deprecate `download_custom` * Better msg

Deprecate download_custom #6093

Deprecate download_custom #6093

Conversation

mariosasko commented Jul 28, 2023

lhoestq left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Jul 28, 2023 • edited Loading

github-actions bot commented Jul 28, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Jul 28, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

mariosasko commented Jul 28, 2023

github-actions bot commented Jul 28, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

ProgramComputer commented Aug 20, 2023 • edited Loading

Deprecate `download_custom` #6093

Deprecate `download_custom` #6093

HuggingFaceDocBuilderDev commented Jul 28, 2023 •

edited

Loading

ProgramComputer commented Aug 20, 2023 •

edited

Loading