Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(gatsby-core-utils): make createContentDigest deterministic #19832

Merged
merged 6 commits into from
Nov 29, 2019

Conversation

wardpeet
Copy link
Contributor

@wardpeet wardpeet commented Nov 27, 2019

Description

Make createContentDigest deterministic. Currently, we use JSON.stringify to convert an object to a string which is fine but not great. Whenever the order of the arguments change the hash differs which bust caches when it shouldn't.

example
{ arg1: 'tst', arg2: 'x' } is not equal to {arg2: 'x', arg1: 'tst'}
{ arg1: ['tst', 'tst2'], arg2: 'x' } is not equal to {arg2: 'x', arg1: ['tst2', 'tst']}

The first example should be equal and the second one shouldn't as we can't make sure if arrays are deterministic. Honestly, I feel like it should for our use cases but I didn't want to break anything so I've added an option to make it so.

Jobs (#19831) will be 100% deterministic so I want to sort arrays too.

I've compared hash-object & node-object-hash. It looked like node-object-hash was faster.

Related Issues

#19831

@wardpeet wardpeet requested a review from a team as a code owner November 27, 2019 09:38
@pieh
Copy link
Contributor

pieh commented Nov 27, 2019

The first example should be equal and the second one shouldn't as we can't make sure if arrays are deterministic. Honestly, I feel like it should for our use cases but I didn't want to break anything so I've added an option to make it so.

When would we want to actually sort arrays before generating hash?

I've compared hash-object & node-object-hash. It looked like node-object-hash was faster.

Did you do perf comparisons for createContentDigest ( previous version vs proposed change )?

@wardpeet
Copy link
Contributor Author

wardpeet commented Nov 27, 2019

When would we want to actually sort arrays before generating hash?

Jobs API would favor this as mistakes can easily be made where arguments have different order but the same work will be done. If we look at gatsby-plugin-sharp, the transformations needed on an input file could be in a different order, which doesn't mean the result would be different.

Did you do perf comparisons for createContentDigest ( previous version vs proposed change )?

I haven't yet, I was going to run it against .org and see what the difference was and also do an isolated micro benchmark.

@pieh
Copy link
Contributor

pieh commented Nov 27, 2019

Jobs API would favor this as mistakes can easily be made where arguments have different order but the same work will be done. If we look at gatsby-plugin-sharp, the transformations needed on an input file could be in a different order, which doesn't mean the result would be different.

I feel like this is potentially dangerous: for image transformation order of transformations doesn't matter, but it might matter for other jobs in the future. To me it seems like we should guarantee stable transformations order in sharp plugin instead (so that hash will always be the same, without resorting to createContentDigest sorting arrays)

@wardpeet
Copy link
Contributor Author

wardpeet commented Nov 27, 2019

We can let the plugin authors decide. They could sort it upfront. I'll revert that.

@wardpeet
Copy link
Contributor Author

I did benchmarks on .org and microbenchmarks.

It looks like hash-object is slower, I don't think it is slow enough to not do it. Technically I could only use it for jobs but I feel proper hashing is not a waste.

Summary:

lifecycle create-content-digest node-object-hash
source and transform nodes 120s 121s
bootstrap finished 156s 157s
run queries 274s 282s
gatsby complete 489s 496s
time measure 623s 617s

Microbench:
Doesn't seems like much difference in complexity. It's just a difference in speed of my pc XD

amount create-content-digest node-object-hash
100 nested objects 1424 ops/s 899 ops/s
50 nested objects 2038 ops/s 1333 ops/s
25 nested objects 2379 ops/s 1637 ops/s
10 nested objects 2173 ops/s 1282 ops/

When running strings only so no objects
node-object-hash-v1 x 56,678 ops/sec ±4.65% (75 runs sampled)
createContentDigest x 61,555 ops/sec ±1.78% (88 runs sampled)

So for non-objects I could use the old createContentDigest and for objects go for node-object-hash.

Run 1 - node-hash-object
$ time gatsby build --no-uglify
success open and validate gatsby-configs - 0.165s
success load plugins - 0.851s
success onPreInit - 0.005s
success delete html and css files from previous builds - 0.008s
success initialize cache - 0.007s
success copy gatsby files - 0.016s
success onPreBootstrap - 0.007s
success createSchemaCustomization - 0.153s
success source and transform nodes - 108.856s
warn There are conflicting field types in your data.

If you have explicitly defined a type for those fields, you can safely ignore this warning message.
Otherwise, Gatsby will omit those fields from the GraphQL schema.

If you know all field types in advance, the best strategy is to explicitly define them with the `createTypes` action, and skip inference with the `@dontInfer` directive.
See https://www.gatsbyjs.org/docs/actions/#createTypes
NPMPackage.types.ts:
 - type: boolean
   value: false
 - type: string
   value: 'included'
success building schema - 5.275s
warn code block or inline code language not specified in markdown. applying generic code block
warn unable to find prism language 'sh' for highlighting. applying generic code block
warn unable to find prism language 'console' for highlighting. applying generic code block
warn unable to find prism language ' for highlighting. applying generic code block
warn unable to find prism language '.gitignore' for highlighting. applying generic code block
warn unable to find prism language '.json' for highlighting. applying generic code block
warn unable to find prism language 'grahpql' for highlighting. applying generic code block
warn unable to find prism language 'gatsby-config.js' for highlighting. applying generic code block
warn unable to find prism language 'plain' for highlighting. applying generic code block
success createPages - 27.180s
success createPagesStatefully - 1.001s
success onPreExtractQueries - 0.004s
success update schema - 0.093s
success extract queries from components - 0.791s
success write out requires - 0.037s
success write out redirect data - 0.002s
success Build manifest and related icons - 0.098s
success onPostBootstrap - 0.165s

info bootstrap finished - 146.793 s

success Building production JavaScript and CSS bundles - 26.710s
success Rewriting compilation hashes - 0.003s
success run queries - 282.961s - 3449/3449 12.19/s
success Building static HTML for pages - 32.678s - 3444/3444 105.39/s
info Generated public/sw.js, which will precache 7 files, totaling 1440680 bytes.
The following pages will be precached:
/offline-plugin-app-shell-fallback/index.html
info Done building in 485.215770107 sec
gatsby build --no-uglify  620.22s user 18.88s system 131% cpu 8:05.74 total
Run 2 - node-hash-object
$ time gatsby build --no-uglify
success open and validate gatsby-configs - 0.154s
success load plugins - 0.838s
success onPreInit - 0.005s
success delete html and css files from previous builds - 0.008s
success initialize cache - 0.008s
success copy gatsby files - 0.015s
success onPreBootstrap - 0.012s
success createSchemaCustomization - 0.149s
success source and transform nodes - 128.456s
warn There are conflicting field types in your data.

If you have explicitly defined a type for those fields, you can safely ignore this warning message.
Otherwise, Gatsby will omit those fields from the GraphQL schema.

If you know all field types in advance, the best strategy is to explicitly define them with the `createTypes` action, and skip inference with the `@dontInfer` directive.
See https://www.gatsbyjs.org/docs/actions/#createTypes
NPMPackage.types.ts:
 - type: boolean
   value: false
 - type: string
   value: 'included'
success building schema - 5.113s
warn code block or inline code language not specified in markdown. applying generic code block
warn unable to find prism language 'sh' for highlighting. applying generic code block
warn unable to find prism language 'console' for highlighting. applying generic code block
warn unable to find prism language ' for highlighting. applying generic code block
warn unable to find prism language '.gitignore' for highlighting. applying generic code block
warn unable to find prism language '.json' for highlighting. applying generic code block
warn unable to find prism language 'grahpql' for highlighting. applying generic code block
warn unable to find prism language 'gatsby-config.js' for highlighting. applying generic code block
warn unable to find prism language 'plain' for highlighting. applying generic code block
success createPages - 25.406s
success createPagesStatefully - 0.907s
success onPreExtractQueries - 0.004s
success update schema - 0.096s
success extract queries from components - 0.763s
success write out requires - 0.037s
success write out redirect data - 0.002s
success Build manifest and related icons - 0.108s
success onPostBootstrap - 0.174s

info bootstrap finished - 164.296 s

success Building production JavaScript and CSS bundles - 26.500s
success Rewriting compilation hashes - 0.003s
success run queries - 283.734s - 3449/3449 12.16/s
success Building static HTML for pages - 32.600s - 3444/3444 105.64/s
info Generated public/sw.js, which will precache 7 files, totaling 1440680 bytes.
The following pages will be precached:
/offline-plugin-app-shell-fallback/index.html
info Done building in 503.376172368 sec
gatsby build --no-uglify  617.69s user 18.63s system 126% cpu 8:23.85 total
Run 3 - node-hash-object
$ time gatsby build --no-uglify
success open and validate gatsby-configs - 0.157s
success load plugins - 0.837s
success onPreInit - 0.005s
success delete html and css files from previous builds - 0.008s
success initialize cache - 0.008s
success copy gatsby files - 0.016s
success onPreBootstrap - 0.007s
success createSchemaCustomization - 0.152s
success source and transform nodes - 127.890s
warn There are conflicting field types in your data.

If you have explicitly defined a type for those fields, you can safely ignore this warning message.
Otherwise, Gatsby will omit those fields from the GraphQL schema.

If you know all field types in advance, the best strategy is to explicitly define them with the `createTypes` action, and skip inference with the `@dontInfer` directive.
See https://www.gatsbyjs.org/docs/actions/#createTypes
NPMPackage.types.ts:
 - type: boolean
   value: false
 - type: string
   value: 'included'
success building schema - 4.970s
warn code block or inline code language not specified in markdown. applying generic code block
warn unable to find prism language 'sh' for highlighting. applying generic code block
warn unable to find prism language 'console' for highlighting. applying generic code block
warn unable to find prism language ' for highlighting. applying generic code block
warn unable to find prism language '.gitignore' for highlighting. applying generic code block
warn unable to find prism language '.json' for highlighting. applying generic code block
warn unable to find prism language 'grahpql' for highlighting. applying generic code block
warn unable to find prism language 'gatsby-config.js' for highlighting. applying generic code block
warn unable to find prism language 'plain' for highlighting. applying generic code block
success createPages - 24.542s
success createPagesStatefully - 1.010s
success onPreExtractQueries - 0.004s
success update schema - 0.093s
success extract queries from components - 0.749s
success write out requires - 0.037s
success write out redirect data - 0.002s
success Build manifest and related icons - 0.096s
success onPostBootstrap - 0.161s

info bootstrap finished - 162.822 s

success Building production JavaScript and CSS bundles - 26.153s
success Rewriting compilation hashes - 0.003s
success run queries - 282.775s - 3449/3449 12.20/s
success Building static HTML for pages - 32.417s - 3444/3444 106.24/s
info Generated public/sw.js, which will precache 7 files, totaling 1440680 bytes.
The following pages will be precached:
/offline-plugin-app-shell-fallback/index.html
info Done building in 500.548101177 sec
gatsby build --no-uglify  615.29s user 18.94s system 126% cpu 8:21.04 total
Run 1 - createContentDigest
$ time gatsby build --no-uglify
success open and validate gatsby-configs - 0.154s
success load plugins - 0.886s
success onPreInit - 0.006s
success delete html and css files from previous builds - 0.008s
success initialize cache - 0.007s
success copy gatsby files - 0.016s
success onPreBootstrap - 0.008s
success createSchemaCustomization - 0.158s
success source and transform nodes - 123.055s
warn There are conflicting field types in your data.

If you have explicitly defined a type for those fields, you can safely ignore this warning message.
Otherwise, Gatsby will omit those fields from the GraphQL schema.

If you know all field types in advance, the best strategy is to explicitly define them with the `createTypes` action, and skip inference with the `@dontInfer` directive.
See https://www.gatsbyjs.org/docs/actions/#createTypes
NPMPackage.types.ts:
 - type: boolean
   value: false
 - type: string
   value: 'included'
success building schema - 5.246s
warn code block or inline code language not specified in markdown. applying generic code block
warn unable to find prism language 'sh' for highlighting. applying generic code block
warn unable to find prism language 'console' for highlighting. applying generic code block
warn unable to find prism language ' for highlighting. applying generic code block
warn unable to find prism language '.gitignore' for highlighting. applying generic code block
warn unable to find prism language '.json' for highlighting. applying generic code block
warn unable to find prism language 'grahpql' for highlighting. applying generic code block
warn unable to find prism language 'gatsby-config.js' for highlighting. applying generic code block
warn unable to find prism language 'plain' for highlighting. applying generic code block
success createPages - 25.544s
success createPagesStatefully - 0.237s
success onPreExtractQueries - 0.005s
success update schema - 0.091s
success extract queries from components - 0.748s
success write out requires - 0.036s
success write out redirect data - 0.002s
success Build manifest and related icons - 0.098s
success onPostBootstrap - 0.164s

info bootstrap finished - 158.578 s

success Building production JavaScript and CSS bundles - 27.632s
success Rewriting compilation hashes - 0.003s
success run queries - 273.994s - 3450/3450 12.59/s
success Building static HTML for pages - 32.656s - 3445/3445 105.49/s
info Generated public/sw.js, which will precache 7 files, totaling 1445083 bytes.
The following pages will be precached:
/offline-plugin-app-shell-fallback/index.html
info Done building in 491.666674589 sec
gatsby build --no-uglify  621.83s user 19.22s system 130% cpu 8:12.21 total
Run 2 - createContentDigest
$ time gatsby build --no-uglify
success open and validate gatsby-configs - 0.156s
success load plugins - 0.873s
success onPreInit - 0.006s
success delete html and css files from previous builds - 0.008s
success initialize cache - 0.007s
success copy gatsby files - 0.016s
success onPreBootstrap - 0.008s
success createSchemaCustomization - 0.149s
success source and transform nodes - 123.039s
warn There are conflicting field types in your data.

If you have explicitly defined a type for those fields, you can safely ignore this warning message.
Otherwise, Gatsby will omit those fields from the GraphQL schema.

If you know all field types in advance, the best strategy is to explicitly define them with the `createTypes` action, and skip inference with the `@dontInfer` directive.
See https://www.gatsbyjs.org/docs/actions/#createTypes
NPMPackage.types.ts:
 - type: boolean
   value: false
 - type: string
   value: 'included'
success building schema - 5.270s
warn code block or inline code language not specified in markdown. applying generic code block
warn unable to find prism language 'sh' for highlighting. applying generic code block
warn unable to find prism language 'console' for highlighting. applying generic code block
warn unable to find prism language ' for highlighting. applying generic code block
warn unable to find prism language '.gitignore' for highlighting. applying generic code block
warn unable to find prism language '.json' for highlighting. applying generic code block
warn unable to find prism language 'grahpql' for highlighting. applying generic code block
warn unable to find prism language 'gatsby-config.js' for highlighting. applying generic code block
warn unable to find prism language 'plain' for highlighting. applying generic code block
success createPages - 25.942s
success createPagesStatefully - 0.252s
success onPreExtractQueries - 0.005s
success update schema - 0.088s
success extract queries from components - 0.752s
success write out requires - 0.032s
success write out redirect data - 0.002s
success Build manifest and related icons - 0.092s
success onPostBootstrap - 0.156s

info bootstrap finished - 158.972 s

success Building production JavaScript and CSS bundles - 27.425s
success Rewriting compilation hashes - 0.003s
success run queries - 276.823s - 3450/3450 12.46/s
success Building static HTML for pages - 33.062s - 3445/3445 104.20/s
info Generated public/sw.js, which will precache 7 files, totaling 1445083 bytes.
The following pages will be precached:
/offline-plugin-app-shell-fallback/index.html
info Done building in 495.69511604 sec
gatsby build --no-uglify  628.87s user 18.73s system 130% cpu 8:16.20 total
Run 3 - createContentDigest
$ time gatsby build --no-uglify
success open and validate gatsby-configs - 0.160s
success load plugins - 0.893s
success onPreInit - 0.006s
success delete html and css files from previous builds - 0.008s
success initialize cache - 0.013s
success copy gatsby files - 0.016s
success onPreBootstrap - 0.009s
success createSchemaCustomization - 0.151s
success source and transform nodes - 116.077s
warn There are conflicting field types in your data.

If you have explicitly defined a type for those fields, you can safely ignore this warning message.
Otherwise, Gatsby will omit those fields from the GraphQL schema.

If you know all field types in advance, the best strategy is to explicitly define them with the `createTypes` action, and skip inference with the `@dontInfer` directive.
See https://www.gatsbyjs.org/docs/actions/#createTypes
NPMPackage.types.ts:
 - type: boolean
   value: false
 - type: string
   value: 'included'
success building schema - 5.263s
warn code block or inline code language not specified in markdown. applying generic code block
warn unable to find prism language 'sh' for highlighting. applying generic code block
warn unable to find prism language 'console' for highlighting. applying generic code block
warn unable to find prism language ' for highlighting. applying generic code block
warn unable to find prism language '.gitignore' for highlighting. applying generic code block
warn unable to find prism language '.json' for highlighting. applying generic code block
warn unable to find prism language 'grahpql' for highlighting. applying generic code block
warn unable to find prism language 'gatsby-config.js' for highlighting. applying generic code block
warn unable to find prism language 'plain' for highlighting. applying generic code block
success createPages - 25.951s
success createPagesStatefully - 0.236s
success onPreExtractQueries - 0.004s
success update schema - 0.085s
success extract queries from components - 0.744s
success write out requires - 0.035s
success write out redirect data - 0.002s
success Build manifest and related icons - 0.093s
success onPostBootstrap - 0.159s

info bootstrap finished - 152.103 s

success Building production JavaScript and CSS bundles - 27.888s
success Rewriting compilation hashes - 0.003s
success run queries - 273.915s - 3450/3450 12.60/s
success Building static HTML for pages - 33.649s - 3445/3445 102.38/s
info Generated public/sw.js, which will precache 7 files, totaling 1445083 bytes.
The following pages will be precached:
/offline-plugin-app-shell-fallback/index.html
info Done building in 483.561198363 sec
gatsby build --no-uglify  622.61s user 19.46s system 132% cpu 8:04.09 total

@pieh
Copy link
Contributor

pieh commented Nov 28, 2019

I think this is good to merge (only snapshot need updating)

@pvdz
Copy link
Contributor

pvdz commented Nov 28, 2019

Did you check for performance differences at scale? Slightly concerned about the perf impact of manually walking and reordering a lot / big objects like this.

I understand the need, totally support it, and just want to make sure it scales properly :)

@wardpeet
Copy link
Contributor Author

I did run .org, I can run on govbook and check for another large site in your issue.

@pvdz
Copy link
Contributor

pvdz commented Nov 28, 2019

A govbook smoke check would be nice :) Thanks!

@pvdz
Copy link
Contributor

pvdz commented Nov 28, 2019

Wait, what kind of objects actually go through this digest generator? Like, how is that data impacted at scale? (Does it require many pages, many queries, many images?) govbook is a particular subset so I just want to make sure it's actually the relevant benchmark to run here :)

@wardpeet
Copy link
Contributor Author

Every node needs a contentDigest and mostly passes down a node object so it's used pretty heavily for anything data related.

@pvdz
Copy link
Contributor

pvdz commented Nov 28, 2019

Then govbook will serve you well here (:

@wardpeet
Copy link
Contributor Author

Running govbook didn't give me any slow down. Both were averaging around 146s with latest gatsby sources

Copy link
Contributor

@sidharthachatterjee sidharthachatterjee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like its good to go in! Thanks as always @wardpeet ❤️

@sidharthachatterjee sidharthachatterjee added the bot: merge on green Gatsbot will merge these PRs automatically when all tests passes label Nov 29, 2019
@gatsbybot gatsbybot merged commit cb6d0e2 into gatsbyjs:master Nov 29, 2019
@vladar
Copy link
Contributor

vladar commented Dec 2, 2019

Published in [email protected] and [email protected]

@wardpeet wardpeet deleted the fix/content-digest-deterministic branch September 23, 2020 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bot: merge on green Gatsbot will merge these PRs automatically when all tests passes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants