NIdent

NIdent-EN and NIdent-CA are two English and Catalan language corpora annotated with near-identity tags. NIdent-EN contains 49,279 tokens and has its origins in the NP4E corpus (Hasler et al., 2006) from the Reuters Agency. Near-coreferent mentions represent 12% of all coreferent mentions. NIdent-CA comes from AnCora-CA corpus (Recasens and Martí, 2010) and contains 51.622 tokens. AnCora-CA comprises newspaper and newswire articles from El Periódico newspaper, and the ACN news agency. Near-coreferent mentions represent 16% of all coreferent mentions.

The near-coreference annotation was obtained implicitly, based on the idea that different annotators would disagree in labelling a near-identity relation if the only two options they were given were “coreference” and “non-coreference”. Five annotators were asked to annotate the same Catalan and English corpora in parallel with coreference and non-coreference relations. Afterwards we relabelled as “near-identity” the relations that were annotated as coreferent by some but not all the annotators. For a more detailed description of the merging algorithm and the NIdent corpora, we refer the reader to our paper presented in LREC 2012 (Recasens et al., 2012). Please cite this paper if you use our data.

Laura Hasler, Constantin Orasan, and Karin Naumann. 2006. "NPs for Events: Experiments in coreference annotation". In Proceedings of LREC 2006, pages 1167–1172.

Marta Recasens and M. Antònia Martí. 2010. "AnCora-CO: Coreferentially annotated corpora for Spanish and Catalan". In Language Resources and Evaluation, 44(4):315–345.

Marta Recasens, M. Antònia Martí and Constantin Orasan. 2012. "Annotating Near-Identity from Coreference Disagreements". In Proceedings of LREC 2012, pag. 165-172.