🇪🇺 💬 EuroInstructProject

This project generates instruction datasets from existing German, English and other European open source datasets.

Through fine-tuning this should then help to add instruction-following capabilities to GPT models. To learn more about it please see our documentation page.

Instruct GermanDPR Dataset v1 (German)

This dataset is derived from the GermanDPR dataset from deepset.ai. Many thanks to you!

To learn more about the base data set see the arXiv paper: GermanQuAD and GermanDPR: Improving Non-English Question Answering and Passage Retrieval

The record has three columns: input, output and uuid

The input of the record set always consists of a text block and a question about this text block. If the question can be answered with the help of the text block, then the output contains exactly this text passage. If the answer is not contained in the text block, then this is also communicated in the output. The number of positive responses and the number of negative responses are exactly balanced. The column uuid contains a uuid generated with uuid.uuid4().

InstructGermanDPR_v1.ipynb: script to generate this dataset
instruct_data/InstructGermanDPR_v1_train.csv: 18,541 train pairs of input and output
instruct_data/InstructGermanDPR_v1_test.csv: 2,050 test pairs of input and output

The dataset can be loaded with:

df_train = pd.read_csv("./instruct_data/InstructGermanDPR_v1_train.csv")
df_train = pd.read_csv("./instruct_data/InstructGermanDPR_v1_test.csv")

Instruct OPUS Tatoeba v1 (English and German)

This dataset is derived from the Tatoeba v2022-03-03 dataset. The prompt asks to provide a translation.

The record has five columns: input, output, src_lang, target_lang and uuid

The input of the record provides a German or English text and asks in German or English text to translate it. The output contains exactly (and only) the translated text. src_lang and target_lang contains the language from which is translated into the other. The column uuid contains a uuid generated with uuid.uuid4().

Instruct_OPUS_Tatoeba_v1.ipynb: script to generate this dataset
Instruct_OPUS_Tatoeba_v2022_03_03_de_en_v1.csv.gz: 626,254 pairs of input and output

The dataset can be loaded with:

df = pd.read_csv("./instruct_data/Instruct_OPUS_Tatoeba_v2022_03_03_de_en_v1.csv.gz")

Licensing (Datasets)

The code and documentation of this project are under MIT license. The licenses of the datasets are stated below.

Instruct GermanDPR

InstructGermanDPR is licensed under CC BY 4.0. It is derived from the GermanDPR dataset from deepset.ai. Many thanks to you!

With the help of InstructGermanDPR_v1.ipynb changes were made. Please also see the arXiv paper: GermanQuAD and GermanDPR: Improving Non-English Question Answering and Passage Retrieval

Tatoeba v2022-03-03

Tatoeba v2022-03-03 is licensed under CC BY 2.0 FR.

citation: J. Tiedemann, 2012, Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)
copyright: https://tatoeba.org/eng/terms_of_use

Licensing (Code and Documentation)

Licensed under the MIT License (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License by reviewing the file LICENSE in the repository.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
instruct_data		instruct_data
.gitignore		.gitignore
InstructGermanDPR_v1.ipynb		InstructGermanDPR_v1.ipynb
Instruct_OPUS_Tatoeba_v1.ipynb		Instruct_OPUS_Tatoeba_v1.ipynb
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🇪🇺 💬 EuroInstructProject

Instruct GermanDPR Dataset v1 (German)

Instruct OPUS Tatoeba v1 (English and German)

Licensing (Datasets)

Instruct GermanDPR

Tatoeba v2022-03-03

Licensing (Code and Documentation)

About

Languages

License

LEL-A/EuroInstructProject

Folders and files

Latest commit

History

Repository files navigation

🇪🇺 💬 EuroInstructProject

Instruct GermanDPR Dataset v1 (German)

Instruct OPUS Tatoeba v1 (English and German)

Licensing (Datasets)

Instruct GermanDPR

Tatoeba v2022-03-03

Licensing (Code and Documentation)

About

Topics

Resources

License

Stars

Watchers

Forks

Languages