Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

textcat training is not deterministic with gpu enabled #6373

Closed
wlwg opened this issue Nov 10, 2020 · 8 comments · Fixed by #6411
Closed

textcat training is not deterministic with gpu enabled #6373

wlwg opened this issue Nov 10, 2020 · 8 comments · Fixed by #6411
Labels
bug Bugs and behaviour differing from documentation feat / textcat Feature: Text Classifier gpu Using spaCy on GPU reproducibility Consistency, reproducibility, determinism, and randomness training Training and updating models

Comments

@wlwg
Copy link

wlwg commented Nov 10, 2020

How to reproduce the behaviour

This is related to #6177. I can verify that when using CPU, the training losses/weights for textcat can be deterministic with fix_random_seed. However, if I enable GPU via spacy.require_gpu(), the training losses/weights become different every time.

import spacy
spacy.require_gpu()

for _ in range(2):
    spacy.util.fix_random_seed(0)

    model = spacy.load('en_core_web_sm')

    model.add_pipe(model.create_pipe('textcat'))
    model.remove_pipe('parser')
    model.remove_pipe('tagger')

    cat = model.get_pipe('textcat')
    cat.add_label("dog")
    cat.add_label("donut")

    model.begin_training()
    print(model("What even is?").cats)

Output:

{'dog': 0.2501096725463867, 'donut': 0.3427947163581848}
{'dog': 0.9567031860351562, 'donut': 0.9506585001945496}

Your Environment

  • Operating System: Linux
  • Python Version Used: 3.6.9
  • spaCy Version Used: latest on master (git sha: 320a8b1)
  • Environment Information: Google Colab
@adrianeboyd adrianeboyd added feat / textcat Feature: Text Classifier gpu Using spaCy on GPU training Training and updating models labels Nov 11, 2020
@adrianeboyd
Copy link
Contributor

Hmm, I can't reproduce this.

Can you double-check by explicitly uninstalling spacy in colab before installing from master? It's possible that the default spacy install isn't being replaced/uninstalled cleanly when you install from source.

What do you see in spacy.git_info.GIT_VERSION?

@adrianeboyd adrianeboyd added the more-info-needed This issue needs more information label Nov 11, 2020
@svlandeg
Copy link
Member

And what is your thinc version?

@wlwg
Copy link
Author

wlwg commented Nov 13, 2020

@adrianeboyd @svlandeg
spacy.__version__: 2.3.2
spacy.git_info.GIT_VERSION: 320a8b148
thinc: 7.4.1

I just wrote up a more detailed script: https://colab.research.google.com/drive/1lVJpVE-SS85jQP3LdkuZkhKvpBA0EuXM?usp=sharing

@no-response no-response bot removed the more-info-needed This issue needs more information label Nov 13, 2020
@adrianeboyd
Copy link
Contributor

Hmm, I do think there may be a bug of some sort here in spacy v2. Locally and with the colab example above I get consistent results within multiple CPU and GPU runs (also with our quick internal test cases related to this), but the CPU and GPU results are not similar to each other, and if I extend the training a bit I do get different results for multiple GPU runs. We will look into it!

In better news, with spacy v3 I get the same results on both (minus some float rounding differences, of course).

@adrianeboyd adrianeboyd added the bug Bugs and behaviour differing from documentation label Nov 16, 2020
@svlandeg
Copy link
Member

I'd be happy to look into this further, but I can't reproduce... :(

If I run this on either CPU or GPU, I just keep getting consistent results, after installing a clean copy of spacy[cuda101].
I can run the training loop 200 times, just keep getting the same result.

The only thing I can think of right now, that this happens on Linux and not Windows? Though that makes little sense to me. @adrianeboyd : you couldn't replicate at first either - what exactly did you change to replicate this?

@adrianeboyd
Copy link
Contributor

Here's my test script (just adapted a bit from the one in the colab example):

import spacy
from spacy.util import minibatch, compounding

def train():
    spacy.util.fix_random_seed(0)
    model = spacy.blank("en")

    model.add_pipe(model.create_pipe("textcat"))

    cat = model.get_pipe("textcat")
    cat.add_label("dog")
    cat.add_label("donut")

    x_train = [f"example {i}" for i in range(1000)]
    y_train = [{"cats": {"dog": i/1000, "donut": 1 - i/1000}} for i in range(1000)]
    train_data = list(zip(x_train, y_train))

    optimizer = model.begin_training()
    for i in range(10):
        batches = minibatch(train_data, size=compounding(16, 64, 1.001))
        losses = {}
        for batch in batches:
            x_batch, y_batch = zip(*batch)
            model.update(x_batch, y_batch, sgd=optimizer, drop=0, losses=losses)
        print(i, "loss:", losses["textcat"])
    print("example 10:", model("example 10").cats)
    print()

if __name__ == "__main__":
    print("1st time CPU:")
    train()
    print("2nd time CPU:")
    train()
    print("\nEnabling GPU\n")
    spacy.require_gpu()
    print("1st time GPU:")
    train()
    print("2nd time GPU:")
    train()

Output:

1st time CPU:
0 loss: 0.020526510332956605
1 loss: 0.2192715626588324
2 loss: 0.1541586974939264
3 loss: 0.21435572720838536
4 loss: 0.1982542650088135
5 loss: 0.19825033005452042
6 loss: 0.19787737677813766
7 loss: 0.016827800470196053
8 loss: 0.02887996903154999
9 loss: 0.02469563187116819
example 10: {'dog': 0.001906172838062048, 'donut': 0.6181842684745789}

2nd time CPU:
0 loss: 0.020526510332956605
1 loss: 0.2192715626588324
2 loss: 0.1541586974939264
3 loss: 0.21435572720838536
4 loss: 0.1982542650088135
5 loss: 0.19825033005452042
6 loss: 0.19787737677813766
7 loss: 0.016827800470196053
8 loss: 0.02887996903154999
9 loss: 0.02469563187116819
example 10: {'dog': 0.001906172838062048, 'donut': 0.6181842684745789}


Enabling GPU

1st time GPU:
0 loss: 0.022869700213050237
1 loss: 0.06781688092814875
2 loss: 0.15603950362856267
3 loss: 0.029185388615587726
4 loss: 0.04577569641696755
5 loss: 0.03271988184133079
6 loss: 0.030841199260066787
7 loss: 0.016764739026257303
8 loss: 0.023379557263069728
9 loss: 0.020565684088069247
example 10: {'dog': 0.15584374964237213, 'donut': 0.9999545812606812}

2nd time GPU:
0 loss: 0.022846033180030645
1 loss: 0.07457155887192357
2 loss: 0.1533858735638205
3 loss: 0.03846120528942265
4 loss: 0.030317590604681754
5 loss: 0.022946339027839713
6 loss: 0.040068494405659294
7 loss: 0.023592466532136314
8 loss: 0.02665060829349386
9 loss: 0.021907005400862545
example 10: {'dog': 0.15843163430690765, 'donut': 0.9288136959075928}

I tested in a new venv with everything from wheels except spacy (from master as of now). example 10 is the model cats output for the text "example 10".

example 10 for a few more GPU runs:

{'dog': 0.2435295134782791, 'donut': 0.9999375343322754}
{'dog': 0.4791581332683563, 'donut': 0.9981231093406677}
{'dog': 0.6463608145713806, 'donut': 0.016409972682595253}
{'dog': 0.14756248891353607, 'donut': 0.9230985045433044}

pip freeze: freeze.txt

I redid the test with v3 and the results are a bit more variable than I thought between CPU and GPU, but they're not that different across GPU runs.

CPU: {'dog': 0.0654868334531784, 'donut': 0.9892733693122864}
GPU 1: {'dog': 0.022449197247624397, 'donut': 0.9723042249679565}
GPU 2: {'dog': 0.02237524650990963, 'donut': 0.9726961255073547}
GPU 3: {'dog': 0.022426428273320198, 'donut': 0.9722701907157898}
GPU 4: {'dog': 0.02197781391441822, 'donut': 0.9722147583961487}

@svlandeg
Copy link
Member

svlandeg commented Nov 19, 2020

Thanks Adriane - the original script didn't have a model.update function which prevented reproducing this.

I was able to finally track this down to the ParametricAttention layer of the CNN model in the default textcat architecture. PR #6411 should fix this - but it requires an update of Thinc to 7.4.3 (to be released).

@github-actions
Copy link
Contributor

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 29, 2021
@polm polm added the reproducibility Consistency, reproducibility, determinism, and randomness label Nov 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / textcat Feature: Text Classifier gpu Using spaCy on GPU reproducibility Consistency, reproducibility, determinism, and randomness training Training and updating models
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants