Google Research Datasets

Datasets released by Google Research

Pinned Loading

natural-questions natural-questions Public

Natural Questions (NQ) contains real user questions issued to Google search, and answers found from Wikipedia by annotators. NQ is designed for the training and evaluation of automatic question ans…

Python 938 153
conceptual-captions conceptual-captions Public

Conceptual Captions is a dataset containing (image-URL, caption) pairs designed for the training and evaluation of machine learned image captioning systems.

Shell 519 26
Objectron Objectron Public

Objectron is a dataset of short, object-centric video clips. In addition, the videos also contain AR session metadata including camera poses, sparse point-clouds and planes. In each video, the came…

Jupyter Notebook 2.2k 263
wit wit Public

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

1k 41
paws paws Public

This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase ident…

Python 555 52
dstc8-schema-guided-dialogue dstc8-schema-guided-dialogue Public

The Schema-Guided Dialogue Dataset

Python 548 124

Repositories

hiertext Public
The HierText dataset contains ~12k images from the Open Images dataset v6 with large amount of text entities. We provide word, line and paragraph level annotations.

google-research-datasets/hiertext’s past year of commit activity

Jupyter Notebook 264 CC-BY-SA-4.0 24 2 1 Updated Nov 8, 2024
Education-Dialogue-Dataset Public
Dataset of conversations, generated by prompting Gemini Ultra. These are conversations between a teacher and a student, where the teacher is prompted with specific topic to teach the student, and the student is prompted with their learning preferences. https://arxiv.org/abs/2405.14655

google-research-datasets/Education-Dialogue-Dataset’s past year of commit activity

1 0 0 0 Updated Oct 29, 2024
sanpo_dataset Public

google-research-datasets/sanpo_dataset’s past year of commit activity

Python 40 Apache-2.0 2 3 2 Updated Oct 28, 2024
GeniL Public
GeniL dataset is an effort for detecting various types of generalization in language. This multilingual dataset covers sentences in EN, FR, ES, PT, AR, HI, BN, MS, and ID and is annotated by native speakers of each language. Each sentence is collected from a public corpora of language and contains at least one identity group name and an attribute.

google-research-datasets/GeniL’s past year of commit activity

1 CC-BY-4.0 1 0 0 Updated Oct 18, 2024
uicrit Public
UICrit is a dataset containing human-generated natural language design critiques, corresponding bounding boxes for each critique, and design quality ratings for 1,000 mobile UIs from RICO. This dataset was collected for our UIST '24 paper: https://arxiv.org/abs/2407.08850.

google-research-datasets/uicrit’s past year of commit activity

6 1 0 0 Updated Oct 15, 2024
tap-typing-with-touch-sensing-images Public
The Tap Typing with Touch Sensing Images (TSI) dataset contains data of user taps on a mobile touchscreen keyboard, including elliptical features and capacitive sensing images of the taps. The dataset aligns each tap with a key the user intended to type during data collection so it can be used for keyboard decoder training and/or evaluation.

google-research-datasets/tap-typing-with-touch-sensing-images’s past year of commit activity

1 CC-BY-4.0 1 1 0 Updated Oct 15, 2024
mittens Public
Datasets for measuring misgendering in translation

google-research-datasets/mittens’s past year of commit activity

5 0 0 0 Updated Oct 4, 2024
adversarial-nibbler Public
This dataset contains results from all rounds of Adversarial Nibbler. This data includes adversarial prompts fed into public generative text2image models and validations for unsafe images. There will be two sets of data: all prompts submitted and all prompts attempted (sent to t2i models but not submitted as unsafe).

google-research-datasets/adversarial-nibbler’s past year of commit activity

20 CC-BY-4.0 3 0 0 Updated Sep 30, 2024
wit Public
WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

google-research-datasets/wit’s past year of commit activity

1,011 41 1 0 Updated Sep 27, 2024
C4_200M-synthetic-dataset-for-grammatical-error-correction Public
This dataset contains synthetic training data for grammatical error correction. The corpus is generated by corrupting clean sentences from C4 using a tagged corruption model. The approach and the dataset are described in more detail by Stahlberg and Kumar (2021) (https://www.aclweb.org/anthology/2021.bea-1.4/)

google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction’s past year of commit activity

Python 156 CC-BY-4.0 24 0 0 Updated Sep 24, 2024

View all repositories

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Python Jupyter Notebook C++ HTML Macaulay2

Most used topics

Loading…

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Google Research Datasets

Pinned Loading

Repositories

People

Top languages

Most used topics