Skip to content
@google-research-datasets

Google Research Datasets

Datasets released by Google Research

Pinned Loading

  1. natural-questions natural-questions Public

    Natural Questions (NQ) contains real user questions issued to Google search, and answers found from Wikipedia by annotators. NQ is designed for the training and evaluation of automatic question ans…

    Python 927 152

  2. conceptual-captions conceptual-captions Public

    Conceptual Captions is a dataset containing (image-URL, caption) pairs designed for the training and evaluation of machine learned image captioning systems.

    Shell 517 26

  3. Objectron Objectron Public

    Objectron is a dataset of short, object-centric video clips. In addition, the videos also contain AR session metadata including camera poses, sparse point-clouds and planes. In each video, the came…

    Jupyter Notebook 2.2k 263

  4. wit wit Public

    WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

    1k 41

  5. paws paws Public

    This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase ident…

    Python 548 52

  6. dstc8-schema-guided-dialogue dstc8-schema-guided-dialogue Public

    The Schema-Guided Dialogue Dataset

    Python 543 124

Repositories

Showing 10 of 162 repositories
  • uicrit Public

    UICrit is a dataset containing human-generated natural language design critiques, corresponding bounding boxes for each critique, and design quality ratings for 1,000 mobile UIs from RICO. This dataset was collected for our UIST '24 paper: https://arxiv.org/abs/2407.08850.

    google-research-datasets/uicrit’s past year of commit activity
    6 1 0 0 Updated Oct 15, 2024
  • tap-typing-with-touch-sensing-images Public

    The Tap Typing with Touch Sensing Images (TSI) dataset contains data of user taps on a mobile touchscreen keyboard, including elliptical features and capacitive sensing images of the taps. The dataset aligns each tap with a key the user intended to type during data collection so it can be used for keyboard decoder training and/or evaluation.

    google-research-datasets/tap-typing-with-touch-sensing-images’s past year of commit activity
    1 CC-BY-4.0 1 1 0 Updated Oct 15, 2024
  • mittens Public

    Datasets for measuring misgendering in translation

    google-research-datasets/mittens’s past year of commit activity
    5 0 0 0 Updated Oct 4, 2024
  • adversarial-nibbler Public

    This dataset contains results from all rounds of Adversarial Nibbler. This data includes adversarial prompts fed into public generative text2image models and validations for unsafe images. There will be two sets of data: all prompts submitted and all prompts attempted (sent to t2i models but not submitted as unsafe).

    google-research-datasets/adversarial-nibbler’s past year of commit activity
    20 CC-BY-4.0 3 0 0 Updated Sep 30, 2024
  • wit Public

    WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

    google-research-datasets/wit’s past year of commit activity
    1,002 41 1 0 Updated Sep 27, 2024
  • C4_200M-synthetic-dataset-for-grammatical-error-correction Public

    This dataset contains synthetic training data for grammatical error correction. The corpus is generated by corrupting clean sentences from C4 using a tagged corruption model. The approach and the dataset are described in more detail by Stahlberg and Kumar (2021) (https://www.aclweb.org/anthology/2021.bea-1.4/)

    google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction’s past year of commit activity
    Python 154 CC-BY-4.0 24 0 0 Updated Sep 24, 2024
  • google-research-datasets/sanpo_dataset’s past year of commit activity
    Python 40 Apache-2.0 1 3 2 Updated Sep 19, 2024
  • SeeGULL-Multilingual Public

    SeeGULL Multilingual is a multilingual and multicultural dataset of stereotypes. It consists of stereotypes in 20 languages with human annotations across 23 languages, including annotations on their degree of offensiveness.

    google-research-datasets/SeeGULL-Multilingual’s past year of commit activity
    3 CC-BY-4.0 1 0 0 Updated Sep 19, 2024
  • ToTTo Public

    ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples that proposes a controlled generation task: given a Wikipedia table and a set of highlighted table cells, produce a one-sentence description. We hope it can serve as a useful research benchmark for high-precision conditional text generation.

    google-research-datasets/ToTTo’s past year of commit activity
    436 37 6 0 Updated Sep 11, 2024
  • indic-gen-bench Public

    IndicGenBench is a high-quality, multilingual, multi-way parallel benchmark for evaluating Large Language Models (LLMs) on 4 user-facing generation tasks across a diverse set 29 of Indic languages covering 13 scripts and 4 language families.

    google-research-datasets/indic-gen-bench’s past year of commit activity
    41 6 0 0 Updated Sep 1, 2024

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Loading…

Most used topics

Loading…