Polyglot database support for relationships? #36

budjb · 2018-10-31T22:22:41Z

Admittedly, this isn't so much an issue as it is a request for guidance. It might turn into a feature request.

I'm working on a graphql schema that will source information from a Mongo database and a neo4j database. The documents I'm storing in Mongo are rather large (up to 2MB in some cases), and those documents are highly related. It was easiest to provide traversal between objects via a graph database, and we're not terribly concerned about most of the schema of the documents in Mongo, so the tech makes sense for our use case.

The dataloader library is a big help in optimizing our db queries to Mongo to load documents, but I'm currently at a loss as to how to use neo4j to query for which documents I need to load. The general order of operation looks something like this, for our data:

Load first-level documents from Mongo.
For a second-level portion of a query, collect the entire set of IDs from the results of the first level query, make a single query to neo4j for a list of all related IDs for the relationship type, and map those results to the source IDs (resulting roughly in a mapping of source ID to a list of related IDs).
Load second-level documents from Mongo, with the IDs retrieved from the previous step.

The dataloader library makes the process easy if you can derive the IDs of related documents from a source object, but in my case I'm not able to determine that without a neo4j query. The best I know how to do this at the moment is run a neo4j query for an individual source object in a datafetcher for each source object, meaning I still end up roughly with an n+1 problem.

Any guidance on how to optimize for bulk loading in this scenario?

budjb · 2018-10-31T22:25:19Z

~~TLDR: it would be awesome if a dataloader supported not just a list of IDs to load, but a map of lists of IDs to load keyed by their source ID.~~

budjb · 2018-11-01T20:34:46Z

TLDR: it would be awesome if a dataloader supported bulk querying where only the source ID is known, and the batch loader could use the list of source IDs to query for the target objects.

bbakerman · 2018-11-05T22:33:41Z

The dataloader library makes the process easy if you can derive the IDs of related documents from a source object, but in my case I'm not able to determine that without a neo4j query. The best I know how to do this at the moment is run a neo4j query for an individual source object in a datafetcher for each source object, meaning I still end up roughly with an n+1 problem.

If you have N source IDs then in order to have efficient batch reading you MUST have a capability to perform one query that takes a list of ids.

In your case your neo4j query needs to take N ids as input. I dont know neo4j syntax but maybe something like

 MATCH (node1:Label1)-->(node2:Label2)
 WHERE node1.propertyA IN {value1, value2, values3, valueN}
 RETURN node2.propertyA, node2.propertyB

I am not sure if neo4j can handle that. But think of it like the IN operator in SQL.

Otherwise as you say for N source objects you will get 1+N neo4j queries.

graphql + javadataloader can only "resolve" fields at specific levels, since in graphql the "objects from level 1" feed the values in level 2. So if you have N objects at level 1, then you MUST be able to batch loader N level2 values. if you cant - then its not batch loading.

I am sorry I dont know more about neo4j syntax to do this.

bbakerman · 2018-11-05T22:37:11Z

TLDR: it would be awesome if a dataloader supported bulk querying where only the source ID is known, and the batch loader could use the list of source IDs to query for the target objects.

This is exactly what dataloader does. It takes a list of source IDS and asks that you provide a batch loader function that can return the same sized list of target objects.

If this is not what you meant by that statement can you explain more please

bbakerman · 2018-11-05T23:26:39Z

I was looking more at your explanation and let me see if I can guess what you mean

For a second-level portion of a query, collect the entire set of IDs from the results of the first level query, make a single query to neo4j for a list of all related IDs for the relationship type, and map those results to the source IDs (resulting roughly in a mapping of source ID to a list of related IDs).
Load second-level documents from Mongo, with the IDs retrieved from the previous step.

I your need a BatchLoader that does 2 steps.

It takes all the source ids and feeds them into a neo4j query that itself can take N ids as input
Get back the list of relationaships and then do a Mongo DB query. that itself can takes N ids from the previous steps

If your neo4j / mongo db queries cant take multiple ids as input then indeed you cant batch.

budjb · 2018-11-06T15:27:18Z

I'm sorry my question wasn't clear. I actually know the queries I need to write. Cypher (neo4j's query language) actually does support a SQL-esque IN operation, so I'm good on that front.

I'm actually unclear as to how to use the dataload to provide the information I need. From what I understand, you call a dataloader's load or loadMany method with those IDs of the objects you want to load (the target objects to load). How do I supply my batch loader with the ID of the source object to load from, where I do not yet know the IDs of the target objects to load?

budjb · 2018-11-07T15:26:12Z

I guess this question really boils down to: is it OK to pass not the ID of the sub-object to load but the ID of the source object to load sub-objects for? I'm not sure what happens under the covers, so I can't assume this works correctly with regards to caching, etc. This is that answer I'm looking for I suppose.

bbakerman · 2018-11-08T03:03:09Z

Can you please lay out a graphql queury so I can explain it in terms of that.

I will make one up and use it but if you show your queue that would help

Imagine you have orders and order items adn this queue

query {
     orders(criteria : "xxx") { # list of 10 orders 
         id
         orderitem {  # n order items per order
            name
            price
            relatedItems { # M number of related items per order item
              name
              price
           }       
        }
}

You would have a normal loader behind the orders field since it returns a list of order source objects. Nothing special there.

You would have a dataloader batch function behind orderitems field that loaded all order items for a given source order id. If you had 10 orders, the batch function would be called with 10 source order ids and you would be expected to get 10 order item lists back. Something like

BatchLoader<String,CF<List<List<OrderItem>>>> bl = keysofOrderIds -> {
    return CompleteFuture.supplyAsync(() ->
          serviceLayer.batchloadOrderListsForSourceIds(keysofOrderIds)
   }
}

Notice how its a list of list of order items eg List<List>. Its your job to take set of order ids (source objects) and do a query and break it up into a list of lists per order id

In SQL this might be a "select * from orderitems where orderid in (:keysofOrderIds:) group by orderId"

You would than break that flat list of records into a list of list of orderItems by traversing the result set and creating a new list when the orderid changes.

And then the same on the "relatedItems field". This would be another dataloader but again it will be called with N orderitem keys and if it needs to return N values. if those N values are lists then you need to break them back on "source key" boundaries

if you don't do this breaking down of a single query into keys then you will indeed run into a N+1 query by issuing 1 call per key to get the list of targets.

In terms of caching, the key is ALWAYS the cache key. If it has seen KeyX before and caching is turned on it will return the previous values for KeyX and keyX will NOT be presented to the batch loader

I hope this helps.

budjb mentioned this issue Nov 1, 2018

best practice to load many relations ids from database #34

Closed

budjb closed this as completed Mar 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Polyglot database support for relationships? #36

Polyglot database support for relationships? #36

budjb commented Oct 31, 2018

budjb commented Oct 31, 2018 •

edited

Loading

budjb commented Nov 1, 2018

bbakerman commented Nov 5, 2018

bbakerman commented Nov 5, 2018

bbakerman commented Nov 5, 2018

budjb commented Nov 6, 2018

budjb commented Nov 7, 2018

bbakerman commented Nov 8, 2018

Polyglot database support for relationships? #36

Polyglot database support for relationships? #36

Comments

budjb commented Oct 31, 2018

budjb commented Oct 31, 2018 • edited Loading

budjb commented Nov 1, 2018

bbakerman commented Nov 5, 2018

bbakerman commented Nov 5, 2018

bbakerman commented Nov 5, 2018

budjb commented Nov 6, 2018

budjb commented Nov 7, 2018

bbakerman commented Nov 8, 2018

budjb commented Oct 31, 2018 •

edited

Loading