Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polyglot database support for relationships? #36

Closed
budjb opened this issue Oct 31, 2018 · 8 comments
Closed

Polyglot database support for relationships? #36

budjb opened this issue Oct 31, 2018 · 8 comments

Comments

@budjb
Copy link

budjb commented Oct 31, 2018

Admittedly, this isn't so much an issue as it is a request for guidance. It might turn into a feature request.

I'm working on a graphql schema that will source information from a Mongo database and a neo4j database. The documents I'm storing in Mongo are rather large (up to 2MB in some cases), and those documents are highly related. It was easiest to provide traversal between objects via a graph database, and we're not terribly concerned about most of the schema of the documents in Mongo, so the tech makes sense for our use case.

The dataloader library is a big help in optimizing our db queries to Mongo to load documents, but I'm currently at a loss as to how to use neo4j to query for which documents I need to load. The general order of operation looks something like this, for our data:

  1. Load first-level documents from Mongo.
  2. For a second-level portion of a query, collect the entire set of IDs from the results of the first level query, make a single query to neo4j for a list of all related IDs for the relationship type, and map those results to the source IDs (resulting roughly in a mapping of source ID to a list of related IDs).
  3. Load second-level documents from Mongo, with the IDs retrieved from the previous step.

The dataloader library makes the process easy if you can derive the IDs of related documents from a source object, but in my case I'm not able to determine that without a neo4j query. The best I know how to do this at the moment is run a neo4j query for an individual source object in a datafetcher for each source object, meaning I still end up roughly with an n+1 problem.

Any guidance on how to optimize for bulk loading in this scenario?

@budjb
Copy link
Author

budjb commented Oct 31, 2018

TLDR: it would be awesome if a dataloader supported not just a list of IDs to load, but a map of lists of IDs to load keyed by their source ID.

@budjb
Copy link
Author

budjb commented Nov 1, 2018

TLDR: it would be awesome if a dataloader supported bulk querying where only the source ID is known, and the batch loader could use the list of source IDs to query for the target objects.

@bbakerman
Copy link
Member

The dataloader library makes the process easy if you can derive the IDs of related documents from a source object, but in my case I'm not able to determine that without a neo4j query. The best I know how to do this at the moment is run a neo4j query for an individual source object in a datafetcher for each source object, meaning I still end up roughly with an n+1 problem.

If you have N source IDs then in order to have efficient batch reading you MUST have a capability to perform one query that takes a list of ids.

In your case your neo4j query needs to take N ids as input. I dont know neo4j syntax but maybe something like

 MATCH (node1:Label1)-->(node2:Label2)
 WHERE node1.propertyA IN {value1, value2, values3, valueN}
 RETURN node2.propertyA, node2.propertyB

I am not sure if neo4j can handle that. But think of it like the IN operator in SQL.

Otherwise as you say for N source objects you will get 1+N neo4j queries.

graphql + javadataloader can only "resolve" fields at specific levels, since in graphql the "objects from level 1" feed the values in level 2. So if you have N objects at level 1, then you MUST be able to batch loader N level2 values. if you cant - then its not batch loading.

I am sorry I dont know more about neo4j syntax to do this.

@bbakerman
Copy link
Member

TLDR: it would be awesome if a dataloader supported bulk querying where only the source ID is known, and the batch loader could use the list of source IDs to query for the target objects.

This is exactly what dataloader does. It takes a list of source IDS and asks that you provide a batch loader function that can return the same sized list of target objects.

If this is not what you meant by that statement can you explain more please

@bbakerman
Copy link
Member

I was looking more at your explanation and let me see if I can guess what you mean

For a second-level portion of a query, collect the entire set of IDs from the results of the first level query, make a single query to neo4j for a list of all related IDs for the relationship type, and map those results to the source IDs (resulting roughly in a mapping of source ID to a list of related IDs).
Load second-level documents from Mongo, with the IDs retrieved from the previous step.

I your need a BatchLoader that does 2 steps.

  1. It takes all the source ids and feeds them into a neo4j query that itself can take N ids as input
  2. Get back the list of relationaships and then do a Mongo DB query. that itself can takes N ids from the previous steps

If your neo4j / mongo db queries cant take multiple ids as input then indeed you cant batch.

@budjb
Copy link
Author

budjb commented Nov 6, 2018

I'm sorry my question wasn't clear. I actually know the queries I need to write. Cypher (neo4j's query language) actually does support a SQL-esque IN operation, so I'm good on that front.

I'm actually unclear as to how to use the dataload to provide the information I need. From what I understand, you call a dataloader's load or loadMany method with those IDs of the objects you want to load (the target objects to load). How do I supply my batch loader with the ID of the source object to load from, where I do not yet know the IDs of the target objects to load?

@budjb
Copy link
Author

budjb commented Nov 7, 2018

I guess this question really boils down to: is it OK to pass not the ID of the sub-object to load but the ID of the source object to load sub-objects for? I'm not sure what happens under the covers, so I can't assume this works correctly with regards to caching, etc. This is that answer I'm looking for I suppose.

@bbakerman
Copy link
Member

Can you please lay out a graphql queury so I can explain it in terms of that.

I will make one up and use it but if you show your queue that would help

Imagine you have orders and order items adn this queue

query {
     orders(criteria : "xxx") { # list of 10 orders 
         id
         orderitem {  # n order items per order
            name
            price
            relatedItems { # M number of related items per order item
              name
              price
           }       
        }
}

You would have a normal loader behind the orders field since it returns a list of order source objects. Nothing special there.

You would have a dataloader batch function behind orderitems field that loaded all order items for a given source order id. If you had 10 orders, the batch function would be called with 10 source order ids and you would be expected to get 10 order item lists back. Something like

BatchLoader<String,CF<List<List<OrderItem>>>> bl = keysofOrderIds -> {
    return CompleteFuture.supplyAsync(() ->
          serviceLayer.batchloadOrderListsForSourceIds(keysofOrderIds)
   }
}

Notice how its a list of list of order items eg List<List>. Its your job to take set of order ids (source objects) and do a query and break it up into a list of lists per order id

In SQL this might be a "select * from orderitems where orderid in (:keysofOrderIds:) group by orderId"

You would than break that flat list of records into a list of list of orderItems by traversing the result set and creating a new list when the orderid changes.

And then the same on the "relatedItems field". This would be another dataloader but again it will be called with N orderitem keys and if it needs to return N values. if those N values are lists then you need to break them back on "source key" boundaries

if you don't do this breaking down of a single query into keys then you will indeed run into a N+1 query by issuing 1 call per key to get the list of targets.

In terms of caching, the key is ALWAYS the cache key. If it has seen KeyX before and caching is turned on it will return the previous values for KeyX and keyX will NOT be presented to the batch loader

I hope this helps.

@budjb budjb closed this as completed Mar 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants