Handling larger clusters #13

rjpower · 2014-02-05T18:36:18Z

When we have arrays with more than a few hundred tiles, I've noticed that our performance drops significantly; this is almost certainly due to the various extent operations needed to compute tiles. We can move the extent code to Cython which would give us a big speedup.

Also, the vast majority of arrays have tiles that are all the same shape; we can leverage this to avoid scanning a tile list, and instead use the tile shape to find the target tile, e.g.

pos_to_tile(pos, tile_shape):
  tx = pos[0] / tile_shape[0]
  ty = pos[1] / tile_shape[1]
  ...
  num_tiles_x = array.shape[0] / tile_shape.x
  return ty * num_tiles_x + tx

Run profiles to find bottlenecks for arrays with many tiles
Migrate extent.py to Cython
Special handling for regular tile shapes

The text was updated successfully, but these errors were encountered:

fegin · 2014-02-10T14:27:47Z

Following table shows how much time each benchmark spends on extent.py.

	Master	Workers
large number of tiles	12%	6~10%
reshape	9%	8~10%
transpose	3%	3~10%
benchmark_pagerank	7%	2%
benchmark_lreg	1%	0%
benchmark_finance	12%	1%
benchmark_slice	10%	6~10%
benchmark_dot	1%	0%
benchmark_kmean	4%	0%

After migrating extent.py to Cython and replacing tuples with C arrays, these benchmark spend less than 1% on extent.py (not committed yet).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling larger clusters #13

Handling larger clusters #13

rjpower commented Feb 5, 2014

fegin commented Feb 10, 2014

Handling larger clusters #13

Handling larger clusters #13

Comments

rjpower commented Feb 5, 2014

fegin commented Feb 10, 2014