Paraphrase corpora are collections of paraphrases, which consist of language expressions with a different wording and (approximately) the same meaning.
- P4P
P4P stands for “Paraphrase for Plagiarism”. The P4P corpus contains a partition of the plagiarism cases in the PAN-PC-10 corpus [1] manually annotated with the paraphrase phenomena they contain. It is composed of 847 source-plagiarism pairs in English.
For further reading, refer to the README.txt file in the corresponding download package, [2], and [3].
Paraphrase type annotation guidelines can be found here.
[1] M. Potthast, B. Stein, A. Barrón-Cedeño, and P. Rosso. 2010. An evaluation framework for plagiarism detection. In Proceedings of COLING 2010: Posters, pàgines 997-1005.
[2] M. Vila, M. A. Martí, and H. Rodríguez. 2011. Paraphrase concept and typology. A linguistically based and computationally oriented approach. Procesamiento del Lenguaje Natural 46:83-90.
[3] A. Barrón-Cedeño, M. Vila, M. A. Martí, and P. Rosso. 2013. Plagiarism meets paraphrasing: Insights for the next generation in automatic plagiarism detection. Computational Linguistics, 39(4):917-947
This research work is carried out in the framework of the following projects and grants:
- TEXT-Knowledge 2.0. TIN2009-13391-C04-04
- Text-Enterprise 2.0. TIN2009-13391-C04-03
- VLC/Campus Microcluster on Multimodal Interaction in Intelligent Systems
- EC WIQ-EI IRSES project (grant no. 269180). FP 7 Marie Curie People Framework
- FPU AP2008-02185
- CONACyT-Mexico 192021
MSRP-A stands for “Microsoft Research Paraphrase” corpus “Annotated”. The MSRP-A corpus contains the positive examples in the MSRP corpus [1] manually annotated with the paraphrase phenomena they contain. It is composed of the 3,900 paraphrase pairs in English.
For further reading on the corpus, refer to the README.txt file in the corresponding download package.
Paraphrase type annotation guidelines can be found here.References
[1] W. B. Dolan and C. Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP 2005), pages 9-16.
This research work is carried out in the framework of the following projects and grants:
- TEXT-Knowledge 2.0. TIN2009-13391-C04-04
- KNOW2. TIN2009-14715-C04-04
- FPU AP2008-02185
WRPA stands for “Relational Paraphrase Acquisition from Wikipedia” corpus. The WRPA corpus contains relational paraphrases extracted by the WRPA system from Wikipedia [1]. WRPA contains several sub-corpora:
WRPA-person is composed of a group of 362 paraphrases expressing the person-date_of_birth relation, 449 paraphrases expressing the person-date of death relation and 965 paraphrases expressing the person-place_of_birth relation.
WRPA-person-2 is composed of a group of 55 paraphrases expressing the person-alternate_name relation, 40 paraphrases for person-charge, 54 for person-child, 238 for person-residence, 233 for person-employee_of, 375 for person-member_of, 555 for person-origin, 40 for person-parent, 62 for person-religion, 94 for person-school_attended, 413 for person-spouse and 532 for person-title.
WRPA-authorship is composed of 81,101 pairs of paraphrases expressing the authorship relation.
WRPA-authorship-A is composed of 1,000 paraphrase pairs from WRPA-authorship manually annotated with the paraphrase phenomena they contain.
For further reading on the corpus, refer to the README.txt file in the corresponding download package and [1].
Paraphrase type annotation guidelines can be found here.
[1] M. Vila, M. Antònia Martí and Horacio Rodríguez. Relational paraphrase acquisition from Wikipedia. The WRPA method and corpus (submitted).
This research work is carried out in the framework of the following projects and grants:
- TEXT-Knowledge 2.0. TIN2009-13391-C04-04
- KNOW2. TIN2009-14715-C04-04
- FPU AP2008-02185
ETPC stands for Extended Typology Paraphrase corpus. The Extended Paraphrase Typology (with positive and negative examples) can be found here