Scientific Data

Publicly available, free, online scientific data, largely from university, industry, and government research programs.

353 listings

USA Spending.gov

100/5
1
2
3
4
5

Submitted May 09, 2017 to Scientific Data

Beta.USAspending.gov is the new official source of accessible, searchable and reliable spending data for the U.S. Government.

Treasury released this new version of the USAspending.gov site in accordance with the Digital Accountability and Transparency Act (DATA Act) requirements. The “Beta” site will run concurrently with the previous version of the USAspending.gov website over the summer to minimize disruptions to users' data access and provide more time to add user-centered enhancements. The new Beta.USAspending.gov site tracks agency expenditures and for the first time, links relevant agency expenditure data with awards distributed by the government.

Tags:

Details Rate Report

Affordable Care Act (ACA) 2014-2017 Datasets

100/5
1
2
3
4
5

Submitted Apr 29, 2017 to Scientific Data

This is the only nationally comprehensive, public dataset that includes information on all ACA compliant plans offered in the individual and small group markets.

Health Insurance Exchange Comparison (known as HIX Compare) includes information on premiums, deductibles and out-of-pocket maximums, as well as cost-sharing requirements for primary care and specialist visits, prescription drugs, emergency room services and inpatient and outpatient visits for all plans across all 50 states and the District of Columbia.

Tags:

Details Rate Report

DBpedia

100/5
1
2
3
4
5

Submitted Apr 26, 2017 to Scientific Data

DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link the different data sets on the Web to Wikipedia data. We hope that this work will make it easier for the huge amount of information in Wikipedia to be used in some new interesting ways. Furthermore, it might inspire new mechanisms for navigating, linking, and improving the encyclopedia itself.

Tags: nlp, machine learning

Details Rate Report

WordNet: A Lexical Database of English

100/5
1
2
3
4
5

Submitted Apr 26, 2017 to Scientific Data

WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download. WordNet's structure makes it a useful tool for computational linguistics and natural language processing.

WordNet superficially resembles a thesaurus, in that it groups words together based on their meanings. However, there are some important distinctions. First, WordNet interlinks not just word forms—strings of letters—but specific senses of words. As a result, words that are found in close proximity to one another in the network are semantically disambiguated. Second, WordNet labels the semantic relations among words, whereas the groupings of words in a thesaurus does not follow any explicit pattern other than meaning similarity.

Tags: nlp

Details Rate Report

Stanford Natural Language Inference (SNLI) Corpus

100/5
1
2
3
4
5

Submitted Apr 20, 2017 to Scientific Data

The SNLI corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference (NLI), also known as recognizing textual entailment (RTE). We aim for it to serve both as a benchmark for evaluating representational systems for text, especially including those induced by representation learning methods, as well as a resource for developing NLP models of any kind.

Tags: nlp

Details Rate Report

Quora Question Pairs Dataset

100/5
1
2
3
4
5

Submitted Apr 20, 2017 (Edited Apr 20, 2017) to Scientific Data

Today, we are excited to announce the first in what we plan to be a series of public dataset releases. Our dataset releases will be oriented around various problems of relevance to Quora and will give researchers in diverse areas such as machine learning, natural language processing, network science, etc. the opportunity to try their hand at some of the challenges that arise in building a scalable online knowledge-sharing platform. Our first dataset is related to the problem of identifying duplicate questions.

Tags: nlp, machine learning

Details Rate Report

Abstract Meaning Representation Bank

100/5
1
2
3
4
5

Submitted Apr 17, 2017 to Scientific Data

The AMR Bank is a set of English sentences paired with simple, readable semantic representations. We hope that it will spur new research in natural language understanding, generation, and translation.

The AMR Bank is manually constructed by human annotators at:
- The Linguistic Data Consortium
- SDL
- The University of Colorado's Center for Computational Language and Education Research (CLEAR)
- The University of Southern California's Information Sciences Institute (ISI) and Computational Linguistics at USC.

Tags: nlp

Details Rate Report

HMDB: A large human motion video database

100/5
1
2
3
4
5

Submitted Apr 13, 2017 to Scientific Data

With nearly one billion online videos viewed everyday, an emerging new frontier in computer vision research is recognition and search in video. While much effort has been devoted to the collection and annotation of large scalable static image datasets containing thousands of image categories, human action datasets lack far behind. Here we introduce HMDB collected from various sources, mostly from movies, and a small proportion from public databases such as the Prelinger archive, YouTube and Google videos. The dataset contains 6849 clips divided into 51 action categories, each containing a minimum of 101 clips.

Tags: computer vision

Details Rate Report

Interactive Geologic Map of the Grand Canyon

100/5
1
2
3
4
5

Submitted Apr 10, 2017 to Scientific Data

This work-in-progress represents a spectacular set of data generated by George Billingsley and others at the USGS. The geologic data shown here was taken from the following USGS publications:
- Geologic map of the Mount Trumbull 30 x 60 quadrangle
- Geologic map of the Grand Canyon 30' x 60' quadrangle
- Geologic map of the Valle 30' x 60' quadrangle
- Geologic Map of the Cameron 30' x 60' Quadrangle
- Geologic Map of the Peach Springs 30' x 60' Quadrangle
- Geologic map of the Tuba City 30' x 60' quadrangle

Tags:

Details Rate Report

ANC Manually Annotated Sub-Corpus (MASC)

100/5
1
2
3
4
5

Submitted Apr 07, 2017 (Edited Apr 07, 2017) to Scientific Data

The Manually Annotated Sub-Corpus (MASC) consists of approximately 500,000 words of contemporary American English written and spoken data drawn from the Open American National Corpus (OANC).

All of MASC includes manually validated annotations for sentence boundaries, token, lemma and POS; noun and verb chunks; named entities (person, location, organization, date); Penn Treebank syntax; coreference; and discourse structure. Additional manually produced or validated annotations have been produced by the MASC project for portions of the sub-corpus, including full-text annotation for FrameNet frame elements and a 100K+ sentence corpus with WordNet 3.1 sense tags, of which one-tenth are also annotated for FrameNet frame elements. Annotations of all or portions of the sub-corpus for a wide variety of other linguistic phenomena have been contributed by other projects, including PropBank, TimeBank, Pittsburgh opinion, and several others.

Unlike most freely available corpora including a wide variety of linguistic annotations, MASC contains a balanced selection of texts from a broad range of genres.

MASC is an OPEN LANGUAGE DATA resource that can be downloaded by anyone for any purpose. At the same time, it is a COLLABORATIVE COMMUNITY RESOURCE that will ultimately be sustained by community contributions of annotations and derived data.

Tags: nlp

Details Rate Report

UMCD Dataset: A UAV Mosaicking and Change Detection Dataset

100/5
1
2
3
4
5

Submitted Apr 05, 2017 to Scientific Data

The UMCD Dataset (about 3.50GB) is composed of two main sets of challenging video sequences acquired at very low-altitude. The first set consists of 30 not geo-referenced sequences that can be used only to evaluate mosaicking algorithms. The second set is made up of 10 pairs of geo-referenced sequences (i.e., 20 videos) in which the first can be used to build the mosaic and the second, acquired on the same path, can be used to test change detection algorithms. The geo-referencing allows developers to reduce drastically the number of matching during the search of entities. The dataset is freely available only for research purposes.

Tags: computer vision

Details Rate Report

US National Data Center Radionuclide Reports

100/5
1
2
3
4
5

Submitted Mar 09, 2017 to Scientific Data

The United States operates particle and gas samplers to support the Comprehensive Test Ban Treaty Organization's International Monitoring System. The samplers are designed to collect and measure trace level radioisotopes that might be released into the atmosphere during an underground or atmospheric nuclear test. The reports on this page summarize the collection parameters and the radionuclides detected at the US samplers.

Tags:

Details Rate Report

Google AudioSet: A sound vocabulary and dataset

100/5
1
2
3
4
5

Submitted Mar 08, 2017 to Scientific Data

AudioSet consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos. The ontology is specified as a hierarchical graph of event categories, covering a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds.

By releasing AudioSet, Google Research hopes to provide a common, realistic-scale evaluation task for audio event detection, as well as a starting point for a comprehensive vocabulary of sound events.

Tags: machine learning

Details Rate Report

Paleobiology Database (PaleoBioDB)

100/5
1
2
3
4
5

Submitted Feb 18, 2017 to Scientific Data

The Paleobiology Database (PaleoBioDB) is a non-governmental, non-profit public resource for paleontological data. It has been organized and operated by a multi-disciplinary, multi-institutional, international group of paleobiological researchers. Its purpose is to provide global, collection-based occurrence and taxonomic data for organisms of all geological ages, as well data services to allow easy access to data for independent development of analytical tools, visualization software, and applications of all types. The Database’s broader goal is to encourage and enable data-driven collaborative efforts that address large-scale paleobiological questions.

Tags:

Details Rate Report

NASA Lightning Imaging Sensor Data

100/5
1
2
3
4
5

Submitted Feb 17, 2017 to Scientific Data

The Lightning Imaging Sensor (LIS) detects total lightning (i.e. cloud-to-cloud, cloud-to-ground, and intra-cloud flashes) from a space-based platform.

The LIS is based on digital imaging technology and built around a 128 x 128 charged coupled device (CCD) array that is used to extract only the optical emissions of lightning through Earth's atmosphere for both day and night backgrounds.

Distribution and variability of total lightning. Amount, rate, and radiant energy of total lightning during both day and night.

Tags:

Details Rate Report

Wikimedia Research Data

100/5
1
2
3
4
5

Submitted Feb 15, 2017 to Scientific Data

This page is an overview of the various sources of open-licensed data published by the Wikimedia Foundation or about Wikimedia projects. The information is intended to help community members, developers and researchers learn about available data sources and find the data they need for their work. Data types include wiki content data dumps, MediaWiki API, Tool Labs for connecting to shared server resources, WikiStats and analytics, DBpedia structured data, and more.

Tags:

Details Rate Report

Virtual Fly Brain

100/5
1
2
3
4
5

Submitted Feb 15, 2017 to Scientific Data

Virtual Fly Brain (VFB) is an interactive tool for neurobiologists to explore the detailed neuroanatomy, neuron connectivity and gene expression of Drosophila melanogaster. Our goal is to make it easier for researchers to find relevant anatomical information and reagents.

We integrate the neuroanatomical and expression data from the published literature, as well as image datasets onto the same brain template, making it possible to run cross searches, find similar neurons and compare image data on our 3D Brain Viewer.

Tags:

Details Rate Report

Fueling the Gold Rush: The Greatest Public Datasets for AI

100/5
1
2
3
4
5

Submitted Feb 11, 2017 (Edited Feb 11, 2017) to Scientific Data

It has never been easier to build AI or machine learning-based systems than it is today. The ubiquity of cutting edge open-source tools such as TensorFlow, Torch, and Spark, coupled with the availability of massive amounts of computation power through AWS, Google Cloud, or other cloud providers, means that you can train cutting-edge models from your laptop over an afternoon coffee.

This week, a few machine learning experts and I were talking about all this. To make your life easier, we’ve collected an (opinionated) list of some open datasets that you can’t afford not to know about in the AI world.

Tags: ai, machine learning

Details Rate Report

USGS ScienceBase

100/5
1
2
3
4
5

Submitted Feb 07, 2017 to Scientific Data

ScienceBase is an Open Source project that provides data cataloging, data search and discovery, web services and research community catalogs. current documentation about its structure, information model, services, directory and repository. The wiki provides guidance for using services to interact with the Science API, including JSON examples. Links to examples showing use of ScienceBase services are also provided.

Tags:

Details Rate Report

Radio Garden

100/5
1
2
3
4
5

Submitted Feb 01, 2017 to Scientific Data

Radio Garden is an online app that allows you to find and listen to live radio stations around the world by clicking points on a 3D globe.

Radio Garden incorporates results from the international research project Transnational Radio Encounters directed by Golo Föllmer at Martin-Luther University Halle, in co-operation with the Universities of Copenhagen and Aarhus in Denmark, London Metropolitan and the University of Sunderland in the UK, and Utrecht University in the Netherlands. The project was funded by HERA (Humanities in the European Research Area) from 2013-2016.

Tags:

Details Rate Report