Scientific Data
Publicly available, free, online scientific data, largely from university, industry, and government research programs.
353 listings
Submitted May 09, 2017 to Scientific Data Beta.USAspending.gov is the new official source of accessible, searchable and reliable spending data for the U.S. Government.
Treasury released this new version of the USAspending.gov site in accordance with the Digital Accountability and Transparency Act (DATA Act) requirements. The “Beta” site will run concurrently with the previous version of the USAspending.gov website over the summer to minimize disruptions to users' data access and provide more time to add user-centered enhancements. The new Beta.USAspending.gov site tracks agency expenditures and for the first time, links relevant agency expenditure data with awards distributed by the government. |
Submitted Apr 29, 2017 to Scientific Data This is the only nationally comprehensive, public dataset that includes information on all ACA compliant plans offered in the individual and small group markets.
Health Insurance Exchange Comparison (known as HIX Compare) includes information on premiums, deductibles and out-of-pocket maximums, as well as cost-sharing requirements for primary care and specialist visits, prescription drugs, emergency room services and inpatient and outpatient visits for all plans across all 50 states and the District of Columbia. |
Submitted Apr 26, 2017 to Scientific Data DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link the different data sets on the Web to Wikipedia data. We hope that this work will make it easier for the huge amount of information in Wikipedia to be used in some new interesting ways. Furthermore, it might inspire new mechanisms for navigating, linking, and improving the encyclopedia itself.
|
Submitted Apr 26, 2017 to Scientific Data WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download. WordNet's structure makes it a useful tool for computational linguistics and natural language processing.
WordNet superficially resembles a thesaurus, in that it groups words together based on their meanings. However, there are some important distinctions. First, WordNet interlinks not just word forms—strings of letters—but specific senses of words. As a result, words that are found in close proximity to one another in the network are semantically disambiguated. Second, WordNet labels the semantic relations among words, whereas the groupings of words in a thesaurus does not follow any explicit pattern other than meaning similarity. |
Submitted Apr 20, 2017 to Scientific Data The SNLI corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference (NLI), also known as recognizing textual entailment (RTE). We aim for it to serve both as a benchmark for evaluating representational systems for text, especially including those induced by representation learning methods, as well as a resource for developing NLP models of any kind.
|
Submitted Apr 20, 2017 (Edited Apr 20, 2017) to Scientific Data Today, we are excited to announce the first in what we plan to be a series of public dataset releases. Our dataset releases will be oriented around various problems of relevance to Quora and will give researchers in diverse areas such as machine learning, natural language processing, network science, etc. the opportunity to try their hand at some of the challenges that arise in building a scalable online knowledge-sharing platform. Our first dataset is related to the problem of identifying duplicate questions.
|
Submitted Apr 17, 2017 to Scientific Data The AMR Bank is a set of English sentences paired with simple, readable semantic representations. We hope that it will spur new research in natural language understanding, generation, and translation.
The AMR Bank is manually constructed by human annotators at: - The Linguistic Data Consortium - SDL - The University of Colorado's Center for Computational Language and Education Research (CLEAR) - The University of Southern California's Information Sciences Institute (ISI) and Computational Linguistics at USC. |
Submitted Apr 13, 2017 to Scientific Data With nearly one billion online videos viewed everyday, an emerging new frontier in computer vision research is recognition and search in video. While much effort has been devoted to the collection and annotation of large scalable static image datasets containing thousands of image categories, human action datasets lack far behind. Here we introduce HMDB collected from various sources, mostly from movies, and a small proportion from public databases such as the Prelinger archive, YouTube and Google videos. The dataset contains 6849 clips divided into 51 action categories, each containing a minimum of 101 clips.
|
Submitted Apr 10, 2017 to Scientific Data This work-in-progress represents a spectacular set of data generated by George Billingsley and others at the USGS. The geologic data shown here was taken from the following USGS publications:
- Geologic map of the Mount Trumbull 30 x 60 quadrangle - Geologic map of the Grand Canyon 30' x 60' quadrangle - Geologic map of the Valle 30' x 60' quadrangle - Geologic Map of the Cameron 30' x 60' Quadrangle - Geologic Map of the Peach Springs 30' x 60' Quadrangle - Geologic map of the Tuba City 30' x 60' quadrangle |
Submitted Apr 07, 2017 (Edited Apr 07, 2017) to Scientific Data The Manually Annotated Sub-Corpus (MASC) consists of approximately 500,000 words of contemporary American English written and spoken data drawn from the Open American National Corpus (OANC).
All of MASC includes manually validated annotations for sentence boundaries, token, lemma and POS; noun and verb chunks; named entities (person, location, organization, date); Penn Treebank syntax; coreference; and discourse structure. Additional manually produced or validated annotations have been produced by the MASC project for portions of the sub-corpus, including full-text annotation for FrameNet frame elements and a 100K+ sentence corpus with WordNet 3.1 sense tags, of which one-tenth are also annotated for FrameNet frame elements. Annotations of all or portions of the sub-corpus for a wide variety of other linguistic phenomena have been contributed by other projects, including PropBank, TimeBank, Pittsburgh opinion, and several others. Unlike most freely available corpora including a wide variety of linguistic annotations, MASC contains a balanced selection of texts from a broad range of genres. MASC is an OPEN LANGUAGE DATA resource that can be downloaded by anyone for any purpose. At the same time, it is a COLLABORATIVE COMMUNITY RESOURCE that will ultimately be sustained by community contributions of annotations and derived data. |
Submitted Apr 05, 2017 to Scientific Data The UMCD Dataset (about 3.50GB) is composed of two main sets of challenging video sequences acquired at very low-altitude. The first set consists of 30 not geo-referenced sequences that can be used only to evaluate mosaicking algorithms. The second set is made up of 10 pairs of geo-referenced sequences (i.e., 20 videos) in which the first can be used to build the mosaic and the second, acquired on the same path, can be used to test change detection algorithms. The geo-referencing allows developers to reduce drastically the number of matching during the search of entities. The dataset is freely available only for research purposes.
|
Submitted Mar 09, 2017 to Scientific Data The United States operates particle and gas samplers to support the Comprehensive Test Ban Treaty Organization's International Monitoring System. The samplers are designed to collect and measure trace level radioisotopes that might be released into the atmosphere during an underground or atmospheric nuclear test. The reports on this page summarize the collection parameters and the radionuclides detected at the US samplers.
|
Submitted Mar 08, 2017 to Scientific Data AudioSet consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos. The ontology is specified as a hierarchical graph of event categories, covering a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds.
By releasing AudioSet, Google Research hopes to provide a common, realistic-scale evaluation task for audio event detection, as well as a starting point for a comprehensive vocabulary of sound events. |
Submitted Feb 18, 2017 to Scientific Data The Paleobiology Database (PaleoBioDB) is a non-governmental, non-profit public resource for paleontological data. It has been organized and operated by a multi-disciplinary, multi-institutional, international group of paleobiological researchers. Its purpose is to provide global, collection-based occurrence and taxonomic data for organisms of all geological ages, as well data services to allow easy access to data for independent development of analytical tools, visualization software, and applications of all types. The Database’s broader goal is to encourage and enable data-driven collaborative efforts that address large-scale paleobiological questions.
|
Submitted Feb 17, 2017 to Scientific Data The Lightning Imaging Sensor (LIS) detects total lightning (i.e. cloud-to-cloud, cloud-to-ground, and intra-cloud flashes) from a space-based platform.
The LIS is based on digital imaging technology and built around a 128 x 128 charged coupled device (CCD) array that is used to extract only the optical emissions of lightning through Earth's atmosphere for both day and night backgrounds. Distribution and variability of total lightning. Amount, rate, and radiant energy of total lightning during both day and night. |
Submitted Feb 15, 2017 to Scientific Data This page is an overview of the various sources of open-licensed data published by the Wikimedia Foundation or about Wikimedia projects. The information is intended to help community members, developers and researchers learn about available data sources and find the data they need for their work. Data types include wiki content data dumps, MediaWiki API, Tool Labs for connecting to shared server resources, WikiStats and analytics, DBpedia structured data, and more.
|
Submitted Feb 15, 2017 to Scientific Data Virtual Fly Brain (VFB) is an interactive tool for neurobiologists to explore the detailed neuroanatomy, neuron connectivity and gene expression of Drosophila melanogaster. Our goal is to make it easier for researchers to find relevant anatomical information and reagents.
We integrate the neuroanatomical and expression data from the published literature, as well as image datasets onto the same brain template, making it possible to run cross searches, find similar neurons and compare image data on our 3D Brain Viewer. |
Submitted Feb 11, 2017 (Edited Feb 11, 2017) to Scientific Data It has never been easier to build AI or machine learning-based systems than it is today. The ubiquity of cutting edge open-source tools such as TensorFlow, Torch, and Spark, coupled with the availability of massive amounts of computation power through AWS, Google Cloud, or other cloud providers, means that you can train cutting-edge models from your laptop over an afternoon coffee.
This week, a few machine learning experts and I were talking about all this. To make your life easier, we’ve collected an (opinionated) list of some open datasets that you can’t afford not to know about in the AI world. |
Submitted Feb 07, 2017 to Scientific Data ScienceBase is an Open Source project that provides data cataloging, data search and discovery, web services and research community catalogs. current documentation about its structure, information model, services, directory and repository. The wiki provides guidance for using services to interact with the Science API, including JSON examples. Links to examples showing use of ScienceBase services are also provided.
|
Submitted Feb 01, 2017 to Scientific Data Radio Garden is an online app that allows you to find and listen to live radio stations around the world by clicking points on a 3D globe.
Radio Garden incorporates results from the international research project Transnational Radio Encounters directed by Golo Föllmer at Martin-Luther University Halle, in co-operation with the Universities of Copenhagen and Aarhus in Denmark, London Metropolitan and the University of Sunderland in the UK, and Utrecht University in the Netherlands. The project was funded by HERA (Humanities in the European Research Area) from 2013-2016. |