-
Notifications
You must be signed in to change notification settings - Fork 26.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add doc tests for Albert and Bigbird #16774
Conversation
Co-authored-by: Yih-Dar <[email protected]>
Co-authored-by: Yih-Dar <[email protected]>
Co-authored-by: Yih-Dar <[email protected]>
@ydshieh Could you please take a look at it? I think we still have a problem with from transformers import AlbertTokenizer, AlbertForMaskedLM
import torch
tokenizer = AlbertTokenizer.from_pretrained("albert-base-v2")
model = AlbertForMaskedLM.from_pretrained("albert-base-v2")
input_text = "The capital of France is [MASK]."
target_text = "The capital of France is Paris."
tokenizer.tokenize(input_text)
# ['▁the', '▁capital', '▁of', '▁france', '▁is', ' [MASK]', '▁', '.']
tokenizer.tokenize(target_text )
# ['▁the', '▁capital', '▁of', '▁france', '▁is', '▁paris', '.'] |
The documentation is not available anymore as the PR was closed or merged. |
Hi, @vumichien I won't be available for the next few days. Will check when I am back, or my colleague could check this PR :-) Regarding the Albert tokenizer, do you encounter any runtime error due to the shape issue? I understand that the shapes are different, and had a short discussion with the team. But we thought it should still work. Sorry for not responding this part earlier, but if you see errors due to these shapes, could you post it here, please? |
@ydshieh When I run the test for doc for modeling_albert.py in local, the error will show like the following (sorry for very long error)
The error log is the same when I run test with doc for modeling_tf_albert.py |
Maybe a quick easy way is just to overwrite the examples for AlbertForMaskedLM in the model files. Something similar to #16565 (comment) But that case is reversed: masked input has fewer tokens. So you need to have some different operations. Let's wait @patrickvonplaten to see if he has better suggestion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good for me once the Albert example for masked language modeling is fixed. Thanks!
@@ -2397,6 +2397,8 @@ def set_output_embeddings(self, new_embeddings): | |||
checkpoint=_CHECKPOINT_FOR_DOC, | |||
output_type=MaskedLMOutput, | |||
config_class=_CONFIG_FOR_DOC, | |||
expected_output="'here'", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to comment so late here. Could we maybe overwrite the BigBird example as well? https://huggingface.co/google/bigbird-roberta-base has quite a significant number of downloads and it's know to be a long-range model. Could we maybe provide a long input to be masked here and to all other examples below as well?
Would be great if we could overwrite the example doc string here @ydshieh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a great idea. Let me prepare better examples for Bigbird.
@vumichien @ydshieh, I'd be in favor of overwriting both Albert (so that MLM is correct) as well as BigBird (to show that it's long-range). What do you think? |
@ydshieh @patrickvonplaten I have overwritten both the doc-test examples of Albert and Bigbird. What do you think about them? |
>>> answer_end_index = outputs.end_logits.argmax() | ||
>>> predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1] | ||
>>> tokenizer.decode(predict_answer_tokens) | ||
'Old College' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool example!
That's great! The classification and QA example could be made even much longer for BigBird :-) The examples look already great though. Happy to merge as is as well :-) |
Co-authored-by: Patrick von Platen <[email protected]>
I have changed the longer examples for doctest. The examples are quite long, but in my opinion, they are good to show that Bigbird is long-range model |
Can we put that text in some dataset instead? The documentation will become a bit unreadable with such a long text, where as we could just load a dataset in one line and take the first sample. |
@sgugger Thank you for your suggestion. I have changed to use the examples from squad datasets. How do you think about that? |
Way better, and great that you're showing the shape! Good for me if @patrickvonplaten is okay. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As your previous PR, very high quality!
Thank you @vumichien for the effort to overwrite the doctest code 💯
I leave 2 tiny suggestions, but no need to feel necessary to do it.
Ran locally -> all tests passed!
|
||
>>> LONG_ARTICLE_TARGET = squad_ds[81514]["context"] | ||
>>> # add mask_token | ||
>>> LONG_ARTICLE_TO_MASK = LONG_ARTICLE_TARGET.replace("maximum", "[MASK]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a tiny nit: could we show a few words around the target world maximum
? Just for the readers to be able to see the context, and find the output indeed make sense :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will revise as your suggestion
@@ -2858,9 +2910,12 @@ def __init__(self, config): | |||
@add_start_docstrings_to_model_forward(BIG_BIRD_INPUTS_DOCSTRING.format("batch_size, sequence_length")) | |||
@add_code_sample_docstrings( | |||
processor_class=_TOKENIZER_FOR_DOC, | |||
checkpoint=_CHECKPOINT_FOR_DOC, | |||
checkpoint="vumichien/token-classification-bigbird-roberta-base", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, you have trained a token classification model ..? 💯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, it's just a random weight...(but maybe I will try to train bigbird for token classification in the near future 😅). The reason why I didn't use a model from hf-internal-testing
(hf-internal-testing/tiny-random-bigbird_pegasus) is that I think it's also the random weight model but the output is too long. But if you think it's not a good way, I will revise with this checkpoint hf-internal-testing/tiny-random-bigbird_pegasus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's OK, it is completely fine.
However, if vumichien/token-classification-bigbird-roberta-base
has random weights, it is a good idea to have the name like
vumichien/token-classification-bigbird-roberta-base-random
This way, the doc reader and hub users won't be confused 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. I will change the checkpoint name
>>> squad_ds = load_dataset("squad_v2", split="train") # doctest: +IGNORE_RESULT | ||
|
||
>>> LONG_ARTICLE = squad_ds[81514]["context"] | ||
>>> QUESTION = squad_ds[81514]["question"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: would be good to show the question text as a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will revise as your suggestion
@ydshieh I have revised as your suggestion. Please let me know if I need to revise something. |
Love it! Thank you. |
|
||
>>> tokenizer = BigBirdTokenizer.from_pretrained("l-yohai/bigbird-roberta-base-mnli") | ||
>>> model = BigBirdForSequenceClassification.from_pretrained("l-yohai/bigbird-roberta-base-mnli") | ||
>>> squad_ds = load_dataset("squad_v2", split="train") # doctest: +IGNORE_RESULT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing job @vumichien - thanks a mille for making the example so nice :-)
Merged 🚀 Thanks again! |
* Add doctest BERT * make fixup * fix typo * change checkpoints * make fixup * define doctest output value, update doctest for mobilebert * solve fix-copies * update QA target start index and end index * change checkpoint for docs and reuse defined variable * Update src/transformers/models/bert/modeling_tf_bert.py Co-authored-by: Yih-Dar <[email protected]> * Apply suggestions from code review Co-authored-by: Yih-Dar <[email protected]> * Apply suggestions from code review Co-authored-by: Yih-Dar <[email protected]> * make fixup * Add Doctest for Albert and Bigbird * make fixup * overwrite examples for Albert and Bigbird * Apply suggestions from code review Co-authored-by: Patrick von Platen <[email protected]> * update longer examples for Bigbird * using examples from squad_v2 * print out example text * change name token-classification-big-bird checkpoint to random Co-authored-by: Yih-Dar <[email protected]> Co-authored-by: Patrick von Platen <[email protected]>
What does this PR do?
Add doc tests for Albert and Bigbird, a part of issue #16292
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@patrickvonplaten, @ydshieh
Documentation: @sgugger