31 matching results for "nlp":
Submitted Apr 26, 2017 to Scientific Data DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link the different data sets on the Web to Wikipedia data. We hope that this work will make it easier for the huge amount of information in Wikipedia to be used in some new interesting ways. Furthermore, it might inspire new mechanisms for navigating, linking, and improving the encyclopedia itself.
|
Submitted Apr 26, 2017 to Scientific Data WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download. WordNet's structure makes it a useful tool for computational linguistics and natural language processing.
WordNet superficially resembles a thesaurus, in that it groups words together based on their meanings. However, there are some important distinctions. First, WordNet interlinks not just word forms—strings of letters—but specific senses of words. As a result, words that are found in close proximity to one another in the network are semantically disambiguated. Second, WordNet labels the semantic relations among words, whereas the groupings of words in a thesaurus does not follow any explicit pattern other than meaning similarity. |
Submitted Apr 20, 2017 to Scientific Data The SNLI corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference (NLI), also known as recognizing textual entailment (RTE). We aim for it to serve both as a benchmark for evaluating representational systems for text, especially including those induced by representation learning methods, as well as a resource for developing NLP models of any kind.
|
Submitted Apr 20, 2017 (Edited Apr 20, 2017) to Scientific Data Today, we are excited to announce the first in what we plan to be a series of public dataset releases. Our dataset releases will be oriented around various problems of relevance to Quora and will give researchers in diverse areas such as machine learning, natural language processing, network science, etc. the opportunity to try their hand at some of the challenges that arise in building a scalable online knowledge-sharing platform. Our first dataset is related to the problem of identifying duplicate questions.
|
Submitted Apr 20, 2017 to Scientific Software Google Books Ngram Viewer allows one to easily graph comma-separated phrases and view their occurrence frequency.
|
Submitted Apr 17, 2017 to Scientific Data The AMR Bank is a set of English sentences paired with simple, readable semantic representations. We hope that it will spur new research in natural language understanding, generation, and translation.
The AMR Bank is manually constructed by human annotators at: - The Linguistic Data Consortium - SDL - The University of Colorado's Center for Computational Language and Education Research (CLEAR) - The University of Southern California's Information Sciences Institute (ISI) and Computational Linguistics at USC. |
Submitted Apr 16, 2017 to Science Blogs A look at the importance of Natural Language Processing by Christopher D. Manning.
|
Submitted Apr 15, 2017 (Edited Apr 16, 2017) to Science Courses and Tutorials This notebook was originally prepared for the workshop Advanced Text Analysis with SpaCy and Scikit-Learn, presented as part of NYCDH Week 2017. Here, we try out features of the SpaCy library for natural language processing. We also do some statistical analysis using the scikit-learn library.
|
Submitted Apr 07, 2017 (Edited Apr 07, 2017) to Scientific Data The Manually Annotated Sub-Corpus (MASC) consists of approximately 500,000 words of contemporary American English written and spoken data drawn from the Open American National Corpus (OANC).
All of MASC includes manually validated annotations for sentence boundaries, token, lemma and POS; noun and verb chunks; named entities (person, location, organization, date); Penn Treebank syntax; coreference; and discourse structure. Additional manually produced or validated annotations have been produced by the MASC project for portions of the sub-corpus, including full-text annotation for FrameNet frame elements and a 100K+ sentence corpus with WordNet 3.1 sense tags, of which one-tenth are also annotated for FrameNet frame elements. Annotations of all or portions of the sub-corpus for a wide variety of other linguistic phenomena have been contributed by other projects, including PropBank, TimeBank, Pittsburgh opinion, and several others. Unlike most freely available corpora including a wide variety of linguistic annotations, MASC contains a balanced selection of texts from a broad range of genres. MASC is an OPEN LANGUAGE DATA resource that can be downloaded by anyone for any purpose. At the same time, it is a COLLABORATIVE COMMUNITY RESOURCE that will ultimately be sustained by community contributions of annotations and derived data. |
Submitted Mar 27, 2017 to Science Research Groups » Computer Science CLiPS (Computational Linguistics & Psycholinguistics) is a research center associated with the Linguistics department of the faculty of Arts of the University of Antwerp, and is the result of the fusion of the CNTS and CPL research centers.
Most of the CLiPS research is based on competitively acquired research funding. Funding agencies include the Research Foundation - Flanders, the Institute for the Promotion of Innovation by Science and Technology in Flanders, the Dutch Language Union, the European Commission and occasionally companies. The goal of CLiPS is to produce internationally recognized top research and resources in (developmental) psycholinguistics, (corpus) linguistics, and computational linguistics, and to investigate the interdisciplinary combinations of these disciplines. |
Submitted Mar 27, 2017 to Scientific Software TextRank is an implementation for text summarization and keyword extraction in Python. TextRank also offers text modeling with graph and gexf exportation.
|
Submitted Mar 27, 2017 to Science Research Groups » Computer Science The Text Information Management and Analysis (TIMAN) group is part of the Database and Information Systems (DAIS) Lab of the Computer Science Department at University of Illinois at Urbana-Champaign. We work on a wide spectrum of problems in the general area of text information management and analysis , including retrieval, organization, filtering , summarization, and mining of textual information, aiming at developing advanced text information management and analysis techniques and systems that help people make better use of text information.
|
Submitted Mar 27, 2017 to Science Research Articles We present a novel graph-based summarization framework (Opinosis) that generates concise abstractive summaries of highly redundant opinions. Evaluation results on summarizing user reviews show that Opinosis summaries have better agreement with human summaries compared to the baseline extractive method. The summaries are readable, reasonably well-formed and are informative enough to convey the major opinions.
|
Submitted Mar 27, 2017 to Scientific Software MeTA is a modern C++ data sciences toolkit featuring
- text tokenization, including deep semantic features like parse trees - inverted and forward indexes with compression and various caching strategies - a collection of ranking functions for searching the indexes - topic models - classification algorithms - graph algorithms - language models - CRF implementation (POS-tagging, shallow parsing) - wrappers for liblinear and libsvm (including libsvm dataset parsers) - UTF8 support for analysis on various languages - multithreaded algorithms |
Submitted Mar 27, 2017 to Science Courses and Tutorials Mining word associations from a body of text is often one of the first Natural Language Processing techniques used when mining text data. Word associations are useful for performing NLP tasks such as part of speech tagging, parsing, entity extraction, etc. We will take a brief look at one type of word association called paradigmatic association and show how we can use the Neo4j graph database to help model our text corpus as a graph and implement a simple paradigmatic relation mining algorithm.
|
Submitted Mar 27, 2017 to Science Blogs Graphify is a Neo4j unmanaged extension that provides plug and play natural language text classification. Graphify gives you a mechanism to train natural language parsing models that extract features of a text using deep learning. When training a model to recognize the meaning of a text, you can send an article of text with a provided set of labels that describe the nature of the text. Over time the natural language parsing model in Neo4j will grow to identify those features that optimally disambiguate a text to a set of classes. This blog post explains how it works.
|
Submitted Mar 27, 2017 to Scientific Software Graphify is a Neo4j unmanaged extension used for document and text classification using graph-based hierarchical pattern recognition.
|
Submitted Mar 08, 2017 to Science Blogs The joint many-task model tackles multiple NLP tasks with a single architecture. Tasks are layered such that subsequent and previous tasks benefit from training of the closely-related tasks. Though applied to specific NLP objectives, the proposed model introduces a powerful concept for future research.
|
Submitted Mar 08, 2017 to Science Courses and Tutorials This tutorial covers the skip gram neural network architecture for Word2Vec. My intention with this tutorial was to skip over the usual introductory and abstract insights about Word2Vec, and get into more of the details. Specifically here I’m diving into the skip gram neural network model.
|
Submitted Feb 07, 2017 to Scientific Software A PHP wrapper for the Stanford Natural Language Processing library. Supports POSTagger and CRFClassifier. Loads automatically the right packages and detects the language of the given text.
|