Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to do paging with scan-parallel? #17

Open
ulsa opened this issue Dec 12, 2013 · 15 comments
Open

How to do paging with scan-parallel? #17

ulsa opened this issue Dec 12, 2013 · 15 comments

Comments

@ulsa
Copy link

ulsa commented Dec 12, 2013

I'm trying to do paging with scan-parallel using :limit, but I'm not sure how to specify :last-prim-kvs in subsequent calls. Each segment needs its own last key, I presume.

Am I missing something, or is paging not implemented for parallel scan?

@ptaoussanis
Copy link
Member

Hi Ulrik, sorry for the delay responding to this.

Just to clarify: have you used :limit successfully with scan and are having trouble getting it to work with scan-parallel specifically, or you aren't sure how to use :limit in general and you happen to be using scan-parallel?

I haven't tested it (don't have any db creds with me atm) - but I can't think of a reason why :limit shouldn't work with scan-parallel. It's just a thin scan wrapper to help handle the segment args automatically:

(defn scan-parallel
  "Like `scan` but starts a number of worker threads and automatically handles
  parallel scan options (:total-segments and :segment). Returns a vector of
  `scan` results.

  Ref. http://goo.gl/KLwnn (official parallel scan documentation)."
  [creds table total-segments & [opts]]
  (let [opts (assoc opts :total-segments total-segments)]
    (->> (mapv (fn [seg] (future (scan creds table (assoc opts :segment seg))))
               (range total-segments))
         (mapv deref))))

As for using :limit + :last-prim-kvs: any time you see prim-kvs in a docstring/arg-name it means an argument of form {<hash-key> <val>} or {<hash-key> <val> <range-key> <val>} - i.e. the same form used by get-item, etc.

So to implement paging you'd want to do something like this [untested, don't have a db with me]:

(scan creds :my-table {:limit 2 :attr-conds {:age [:in [24 27]]}})
=> [{:age 24, :name \"Steve\"} {:age 27, :name \"Susan\"}]
(scan creds :my-table {:last-prim-kvs {:age 24 :name \"Susan\"} :attr-conds {:age [:in [24 27]]}})

Does that help?

@ulsa
Copy link
Author

ulsa commented Dec 15, 2013

I am using :limit and paging successfully with scan, but I can't understand how to do it with scan-parallel. Or is it perhaps so that paging is not possible with scan-parallel, because the order is not predictable or something?

I have around a million entries that I want to process, and I don't want to read them all into memory at once. I'm currently using scan with :limit and paging, processing a batch at a time. However, I have trouble reaching the provisioned limits using scan, so I figured I could use scan-parallel. But perhaps it's not designed to support paging.

@ptaoussanis
Copy link
Member

I am using :limit and paging successfully with scan, but I can't understand how to do it with scan-parallel

Sorry I don't have any test dbs on hand atm - it'd help if you could be a little more specific. Are you seeing an error when you replace scan with scan-parallel as in the example I provided above?

Or is it perhaps so that paging is not possible with scan-parallel

It should be possible. Unless I'm misunderstanding something about what you're trying to do - it should literally be as simple as replacing scan with scan-parallel in your call. No args need to change. Nothing about your methodology needs to change. It should work as a drop-in replacement. What happens when you do that?

@ulsa
Copy link
Author

ulsa commented Dec 17, 2013

I didn't want to provide lots of details if I was completely misunderstanding the functionality of scan-parallel, but if you say that paging should work, then let's press on. I'll give you details soon. Meanwhile, consider this:

scan-parallel just passes the given opts on to each underlying scan, with the corresponding :segment number added on, right? If I want to send :last-prim-kvs as opts, like I did when I was doing paging with just scan, then how should I specify those? Each segment needs its own starting point, but as far as I can understand, I can only specify a single :last-prim-kvs. Which segment does that go to? The first? What about the other segments? It just doesn't make sense to me.

@ulsa
Copy link
Author

ulsa commented Dec 17, 2013

I am using :limit and paging successfully with scan:

(scan creds :my-table {:limit 1})

This will give me a vector containing the first page of entries (the actual number depends, in my case 5):

[{:id 1, :x "a"} {:id 2, :x "b"} {:id 3, :x "c"} {:id 4, :x "d"} {:id 5, :x "e"}]

In the next call, I set :last-prim-kvs to {:id 5}, to indicate that I want the scan to start after that id:

(scan creds :my-table {:last-prim-kvs {:id 5} :limit 1})

This will give me the next page of entries:

[{:id 6, :x "f"} {:id 7, :x "g"} {:id 8, :x "h"} {:id 9, :x ""} {:id 10, :x "i"}]

I can't understand how to do it with scan-parallel. The first call is obvious, though. I'm requesting two segments:

(scan-parallel creds :my-table 2 {:limit 1})

This will give me a vector of size 2, where each element is a vector containing some page of entries, not necessarily page one and two:

[
 [{:id 1, :x "a"} {:id 2, :x "b"} {:id 3, :x "c"} {:id 4, :x "d"} {:id 5, :x "e"}]
 [{:id 16, :x "s"} {:id 17, :x "r"} {:id 18, :x "k"} {:id 19, :x "q"} {:id 20, :x "p"}]
]

What about the subsequent calls for the remaining pages? How do I specify :last-prim-kvs? If I do it like with scan, I get this error:

user=> (scan-parallel creds :my-table 2 {:last-prim-kvs {:id "5"} :limit 1})
AmazonServiceException The provided starting key is invalid: Invalid ExclusiveStartKey. 
Please use ExclusiveStartKey with correct Segment. TotalSegments: 2 Segment: 1  
com.amazonaws.http.AmazonHttpClient.handleErrorResponse (AmazonHttpClient.java:679)

You're saying that scan-parallel handles the segment args automatically, but the :last-prim-kvs will be different for each segment. I can see that it could deduce which :last-prim-kvs should go to which segment, if I could pass a vector of :last-prim-kvs maps, but I don't seem to be able to pass a vector. And besides, the pages are not deterministic, it seems, so I fear that scan-parallel can not be used with paging.

@ptaoussanis
Copy link
Member

Hi, closing this - assuming it's gone stale?

@ulsa
Copy link
Author

ulsa commented Aug 27, 2014

I couldn't get it to work, but I still don't know if I did something wrong or if there is something missing in faraday.

@ptaoussanis
Copy link
Member

Yeah, sorry - I'm actually not using DynamoDB myself at the moment. Not sure off hand, and don't have any test dbs handy to look into this quickly. Would need to spend some proper time to dig into the DDB docs + API to confirm: may be a DDB limitation, or a Faraday limitation that needs fixing.

Will reopen in case I do find some time in future, or someone else has some input.

Really sorry to leave you hanging on this, wasn't intentional.

@ptaoussanis ptaoussanis reopened this Aug 27, 2014
@ptaoussanis
Copy link
Member

Quick Google yielded this: http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Scan.html

In a parallel scan, a Scan request that includes ExclusiveStartKey must specify the same segment whose previous Scan returned the corresponding value of LastEvaluatedKey.

So it seems like parallel scans should be pageable, but Faraday's scan implementation would need some work to allow this to be automatic. Have made a TODO note in the code though realistically don't think I'll personally have time to look into this near-term.

You may be able to use scan directly and feed it the necessary parallel segment info; not sure how tricky that'd be to do.

PRs super welcome if you (or anyone else) feels like taking a stab at this!

Cheers :-)

@barkanido
Copy link

Is this something worth fixing for a Faraday noob? Are you open for reviewing PR on this? Any thoughts about a reasonable solution direction?

@kipz
Copy link
Collaborator

kipz commented Jun 26, 2020

I think it would be nice to get this fixed @barkanido given we have the beginning of an implementation, so I expect PR's would be welcomed by the community.

Having said that however, you could probably just manage the threads and paging yourself and just call scan directly, and personally, that's the approach I'd prefer here.

I have an implementation of a lazy-paged-query that will manage query paging automatically, and I'm glad that it was easy to build on top of Faraday, but don't feel that it should be part of the API. Handling thread-pools and paging of large result-sets of scans fits into the same category for me. Of course I don't speak for the community here at all though, it's just my opinion!

@barkanido
Copy link

@kipz fair enough. Maybe your example deserve a place here as an example people can refer to. Or even in the README. Just a thought. Anyway I was just looking for a way to contribute and found this issue. Maybe a task from the TODO is of higher priority?

@kipz
Copy link
Collaborator

kipz commented Jun 26, 2020

@joelittlejohn what are your thoughts on all this?

@joelittlejohn
Copy link
Collaborator

@barkanido Re your question about whether this ticket is a good one for a Faraday noob to tackle, it's probably not 🙂 The existing paging implementation is one of the most complex parts of Faraday and as @kipz mentions people have often found that they prefer to avoid the paging feature altogether and implement their own solution (over which they have more control) outside Faraday.

Is this a feature you need or were you just interested in making a useful contribution? I think the most useful thing to be done for Faraday is better documentation. Better docstrings and/or I think it would be very useful to have a list of examples that show real-world usage covering all typical ways to use Faraday's functions.

@joelittlejohn
Copy link
Collaborator

For a Faraday noob that wants to contribute something useful, I recommend using the library for a while on a few projects and over time you will inevitably uncover something you'd like but is missing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants