Corpora and Datasets

  1. CIC: Catalonia Independence Corpus
  2. Heldugazte Corpus
  3. Spanish AMR Corpus
  4. QWN-PPV: Q-WordNet via Personalized PageRank
  5. ILF-WN: Intermediate Logic Forms from WordNet glosses
  6. Q-WordNet: Extracting polarity from WordNet senses

Catalonia Independence Corpus (CIC)

Two datasets in Spanish (CIC-ES) and Catalan (CIC-CA) consisting of annotated Twitter messages for automatic stance detection. The data was collected 12 days during February and March of 2019. The corpus is annotated with three classes: AGAINST, FAVOR and NEUTRAL, which express stance towards the target, namely, the independence of Catalonia.

The distribution of the classes within the corpus is the following:

Reference:

Elena Zotova, Rodrigo Agerri, Manuel Nuñez and German Rigau (2020). Multilingual Stance Detection in Tweets: The Catalonia Independence Corpus. In LREC 2020.

Heldugazte Corpus

The Heldugazte Corpus contains 6 million tweets in Basque, obtained for the analysis of informal and formal use of Basque language in Tweets.

Heldugazte is divided in two parts, the anotated corpus and the full corpus:

Reference:

Spanish AMR Corpus

Spanish AMR annotations for 50 sentences from the Little Prince Corpus. For more details, go to the corpus site:

Reference:

QWN-PPV: Generate polarity lexicons on demand

A new Q-WordNet version based on applying Personalized PageRanking to the original Q-WordNet approach. It is a simple, robust and (almost) unsupervised dictionary-based method (Q-WordNet by Personalized PageRanking Vector) to automatically generate polarity lexicons.

The extrinsic evaluations performed show that qwn-ppv outperforms other automatically generated lexicons. It also shows very competitive and robust results with respect to manually annotated ones. Results suggest that no single lexicon is best for every task and dataset and that the intrinsic evaluation of polarity lexicons is not a good indicator of good performance on a Sentiment Analysis task.

Our method is easily applicable to create qwn-ppv(s) other languages, and we demonstrate it by providing polarity lexicons for English and Spanish. The qwn-ppv method allows to easily create quality polarity lexicons whenever no domain-based annotated corpora are available for a given language.

Reference:

ILF-WN: Automatic Generation of Intermediate Logic Forms for WordNet glosses

A lexical resource which consists of the automatically generated Intermediate Logic Forms (ILFs) of WordNet’s glosses. Intermediate Logic Forms (ILFs) include extreme neo-davidsonian reification in a simple and flat syntax close to natural language form. In its current form, the representation allows to tackle semantic phenomena such as coreference and anaphora resolution. Moreover, it can be further specified to deal with other specific semantic issues such as quantification.

Intermediate Logic Forms are straightforwardly obtained from the output of pipeline consisting of a part of speech tagger, a dependency parser and our own Intermediate Logic Form generator (all freely available tools). We apply the pipeline to the glosses of WordNet to obtain a lexical resource ready to be used as knowledge base or common knowledge resource for a variety of tasks involving some kind of semantic inference.

Reference:

Q-WordNet: Extracting Polarity from WordNet senses

Q-WordNet is a lexical resource consisting of WordNet senses automatically classified by Positive and Negative polarity. Q-WordNet has been built for versions 1.6, 1.7, 2.0 and 3.0 of WordNet. Version 2.0 has been compared to SentiWordNet 1.0, also built from WordNet 2.0, with very promising results. Q-WordNet first version has been released for the LREC 2010, and included in the Resources Map.

A quantitative evaluation of Q-WordNet as a binary classification task shows important improvements with respect to previous approaches such as SentiWordNet.

Reference: