=============================== Datasets for Entity Recognition
This repository contains datasets from several domains annotated with a variety of entity types, useful for entity recognition and named entity recognition (NER) tasks.
NOTE: I am no longer actively adding datasets to this list -- there are likely more NER datasets that have appeared since 2020. However, I am happy to add more datasets via issues or pull requests.
Datasets for NER in English
.. |check| unicode:: 0x2714
The following table shows the list of datasets for English-language entity recognition (for a list of NER datasets in other languages, see below). The data
directory
contains information on where to obtain those datasets which could not be shared
due to licensing restrictions, as well as code to convert them (if necessary)
to the CoNLL 2003 format. Links to NER corpora in other languages
are also listed below.
============== =============== ======================= =============================== ==================================
Dataset Domain License Reference Availablility
============== =============== ======================= =============================== ==================================
CONLL 2003 News DUA Sang and Meulder, 2003 Easy <https://github.com/patverga/torch-ner-nlp-from-scratch/tree/master/data/conll2003/>
_ to <https://github.com/synalp/NER/tree/master/corpus/CoNLL-2003>
_ find <https://github.com/glample/tagger/tree/master/dataset>
_
NIST-IEER News None NIST 1999 IE-ER NLTK data <https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/ieer.zip>
_
MUC-6 News LDC Grishman and Sundheim, 1996 LDC 2003T13 <https://catalog.ldc.upenn.edu/LDC2003T13>
_
OntoNotes 5 Various LDC Weischedel et al., 2013 LDC 2013T19 <https://catalog.ldc.upenn.edu/LDC2013T19>
_
BBN Various LDC Weischedel and Brunstein, 2005 LDC 2005T33 <https://catalog.ldc.upenn.edu/LDC2005T33>
_
GMB-1.0.0 Various None Bos et al., 2017 http://gmb.let.rug.nl/data.php <http://gmb.let.rug.nl/releases/gmb-1.0.0.zip>
_
GUM-3.1.0 Wiki Several (*2) Zeldes, 2016 |check| Included here
wikigold Wikipedia CC-BY 4.0 Balasuriya et al., 2009 |check| Included here
Ritter Twitter None Ritter et al., 2011 No split <https://github.com/aritter/twitter_nlp/blob/master/data/annotated/ner.txt>
_ , Train/test/dev split <https://github.com/aritter/twitter_nlp/tree/master/data/annotated/wnut16/data>
_
BTC Twitter CC-BY 4.0 Derczynski et al., 2016 |check| Included here
WNUT17 Social media CC-BY 4.0 Derczynski et al., 2017 |check| Included here
i2b2-2006 Medical DUA Uzuner et al., 2007 http://www.i2b2.org <https://www.i2b2.org/NLP/DataSets/Main.php>
_
i2b2-2014 Medical DUA Stubbs et al., 2015 http://www.i2b2.org <https://www.i2b2.org/NLP/DataSets/Main.php>
_
CADEC Medical CSIRO Karimi et al., 2015 http://data.csiro.au/
AnEM Anatomical CC-BY-SA 3.0 Ohta et al., 2012 |check| Included here
MITRestaurant Queries None Liu et al., 2013a http://groups.csail.mit.edu/sls/ <https://groups.csail.mit.edu/sls/downloads/restaurant/>
_
MITMovie Queries None Liu et al., 2013b http://groups.csail.mit.edu/sls/ <https://groups.csail.mit.edu/sls/downloads/movie/>
_
MalwareTextDB Malware None Lim et al., 2017 http://www.statnlp.org/ <http://www.statnlp.org/research/re/MalwareTextDB-1.0.zip>
_
re3d Defense Several (*1) DSTL, 2017 |check| Included here
SEC-filings Finance CC-BY 3.0 Alvarado et al., 2015 |check| Included here
Assembly Robotics X Costa et al., 2017 X
WikiNEuRal Wikipedia CC BY-SA-NC 4.0 Tedeschi et al., 2021 https://github.com/Babelscape/wikineural
MultiNERD Wikipedia CC BY-SA-NC 4.0 Tedeschi et al., 2022 https://github.com/Babelscape/multinerd
HIPE-2022 Historical CC BY-SA-NC 4.0 Ehrmann et al., 2022 https://github.com/hipe-eval/HIPE-2022-data
Music-NER Music MIT Epure and Hennequin, 2023 https://github.com/deezer/music-ner-eacl2023
WIESP2022-NER Astrophysics CC BY-SA-NC 4.0 Grezes et al., 2022 https://huggingface.co/datasets/adsabs/WIESP2022-NER
NNE News CC 4.0 / LDC Ringland et al., 2019 https://github.com/nickyringland/nested_named_entities
WorldWide News CC BY-SA-NC 4.0 Shan et al., 2023 https://github.com/stanfordnlp/en-worldwide-newswire https://arxiv.org/abs/2404.13465
============== =============== ======================= =============================== ==================================
Licenses
Notes on licenses:
(1) re3d ("Relationship and Entity Extraction Evaluation Dataset") contains several datasets, with different licenses. These are:
- CC-BY-SA 3.0 (Wikipedia dataset)
- CC BY-NC 3.0 (BBC_Online dataset)
- CC BY 3.0 AU (Australian_Department_of_Foreign_Affairs dataset)
- public domain (US_State_Department dataset, CENTCOM dataset)
- UK Open Government Licence v3.0 (UK_Government dataset)
- Delegation_of_the_European_Union_to_Syria: see https://eeas.europa.eu/delegations/syria/8157/legal-notice_en
(2) GUM 3.1.0 comprises three datasets, with licenses CC-BY 3.0, CC-BY-SA 3.0 and CC-BY-NC-SA 3.0. The annotations are licensed under CC-BY 4.0.
More detailed license information for each dataset can be found in the corresponding subdirectory.
Later ...
- Tabassum et al., Code and Named Entity Recognition in StackOverflow https://cocoxu.github.io/publications/ACL2020_stackoverflow_NER.pdf
- LitBank: https://github.com/dbamman/litbank (Bamman, Popat and Shen, An Annotated Dataset of Literary Entities, NAACL 2019)
- NNE: A Dataset for Nested Named Entity Recognition in English Newswire, 2019 https://github.com/nickyringland/nested_named_entities
- Mars Target Encyclopedia - LPSC abstracts labeled data set: https://zenodo.org/record/1048419#.W5a2CBwnZhE
- Best Buy queries: https://www.kaggle.com/dataturks/best-buy-ecommerce-ner-dataset/home
- Resume entities for NER: https://www.kaggle.com/dataturks/resume-entities-for-ner/home
- FEW-NERD: A Few-shot Named Entity Recognition Dataset https://aclanthology.org/2021.acl-long.248/
Datasets for NER in other languages
Lexical Named Entity resources
- HeiNER: http://heiner.cl.uni-heidelberg.de/index.shtml
- NECKAr: https://event.ifi.uni-heidelberg.de/?page_id=532#Wikidata_NE_dataset
Code-Switching
- English-Spanish tweets (CALCS 2018): https://code-switching.github.io/2018/ ; https://code-switching.github.io/2018/files/spa-eng/Release.zip ; http://www.aclweb.org/anthology/W18-3219
- Arabic-Egyptian tweets (CALCS 2018): https://code-switching.github.io/2018/ ; https://code-switching.github.io/2018/files/msa-egy/ArabicTweetsTokenAssigner.zip ; http://www.aclweb.org/anthology/W18-3219
- Hindi-English social media text: https://github.com/SilentFlame/Named-Entity-Recognition ; http://aclweb.org/anthology/W18-2405
- EMNLP 2014 Shared Task - Code-Switched Tweets (Nepali-English, Spanish-English, Mandarin-English, Arabic-Arabic dialects): http://emnlp2014.org/workshops/CodeSwitch/call.html
German
- CoNLL 2003 (English, German): https://www.clips.uantwerpen.be/conll2003/ner/
- GermEval 2014: https://sites.google.com/site/germeval2014ner/data
- Tübingen Treebank of Written German (TüBa-D/Z): http://www.sfs.uni-tuebingen.de/en/ascl/resources/corpora/tueba-dz.html
- Europeana Newspapers (Dutch, French, German): https://github.com/EuropeanaNewspapers/ner-corpora ; http://lab.kb.nl/dataset/europeana-newspapers-ner#access
- German EUROPARL transcripts (subset): https://nlpado.de/~sebastian/software/ner_german.shtml
- Named Entity Model for German, Politics (NEMGP): https://www.thomas-zastrow.de/nlp/
- WikiNER: https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500
- WikiNEuRal: https://github.com/Babelscape/wikineural
- MultiNERD: https://github.com/Babelscape/multinerd
- DFKI SmartData Corpus (geo-entities): https://dfki-lt-re-group.bitbucket.io/smartdata-corpus/ (A German Corpus for Fine-Grained Named Entity Recognition and Relation Extraction of Traffic and Industry Events. Martin Schiersch, Veselina Mironova, Maximilian Schmitt, Philippe Thomas, Aleksandra Gabryszak, Leonhard Hennig. Proceedings of LREC, 2018)
- DBpedia abstract corpus (English, German, Dutch, French, Italian, Japanese): http://downloads.dbpedia.org/2015-04/ext/nlp/abstracts/
- DAWT dataset - Densely Annotated Wikipedia Texts across multiple languages (English, Spanish, French, Italian, German, Arabic): https://github.com/klout/opendata/tree/master/wiki_annotation
- Elena Leitner, Georg Rehm, Juli ́an Moreno-Schneider, A Dataset of German Legal Documents for Named Entity Recognition, LREC 2020: http://georg-re.hm/pdf/LREC-2020-Leitner-et-al-preprint.pdf ; Data: https://github.com/elenanereiss/Legal-Entity-Recognition
- HIPE-2022, named entity recognition and entity linking in multilingual historical documents: https://hipe-eval.github.io/HIPE-2022/ https://github.com/hipe-eval/HIPE-2022-data
Dutch
- CoNLL 2002 (Spanish, Dutch): https://www.clips.uantwerpen.be/conll2002/ner/
- Europeana Newspapers (Dutch, French, German): https://github.com/EuropeanaNewspapers/ner-corpora ; http://lab.kb.nl/dataset/europeana-newspapers-ner#access
- MEANTIME Corpus (Parallel corpus: English, Spanish, Italian, Dutch): http://www.newsreader-project.eu/results/data/wikinews/
- WikiNER: https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500
- WikiNEuRal: https://github.com/Babelscape/wikineural
- MultiNERD: https://github.com/Babelscape/multinerd
- DBpedia abstract corpus (English, German, Dutch, French, Italian, Japanese): http://downloads.dbpedia.org/2015-04/ext/nlp/abstracts/
- Dutch parliamentary documents 2015-2016, from 1848.nl (Jonkers, Named Entity Recognition on Dutch Parliamentary Documents using Frog, thesis, University of Amsterdam, 2016): https://github.com/Poezedoez/NER/blob/master/Code/data/lobby/golden_standard
- SONAR 1 - Desmet and Hoste, Fine-grained Dutch named entity recognition, 2014 (hierarchy of classes)
- Corpus-SONAR books and Corpus Gutenberg Dutch: http://blog.namescape.nl/?page_id=85 ; http://portal.clarin.nl/node/1940
Afrikaans
- NCHLT Afrikaans Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/299
Spanish
- CoNLL 2002 (Spanish, Dutch): https://www.clips.uantwerpen.be/conll2002/ner/
- AnCora (Spanish, Catalan): http://clic.ub.edu/corpus/en
- DEFT Spanish Treebank (LDC2018T01): https://catalog.ldc.upenn.edu/LDC2018T01
- PANACEA (LAB): http://panacea-lr.eu/en/info-for-researchers/data-sets/dependency-parsed-corpora/dependency-lab-es
- PANACEA (ENV): http://panacea-lr.eu/en/info-for-researchers/data-sets/dependency-parsed-corpora/dependency-env-es
- MEANTIME Corpus (Parallel corpus: English, Spanish, Italian, Dutch): http://www.newsreader-project.eu/results/data/wikinews/
- ACE 2007 (Spanish and Arabic): https://catalog.ldc.upenn.edu/LDC2014T18
- WikiNER: https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500
- WikiNEuRal: https://github.com/Babelscape/wikineural
- MultiNERD: https://github.com/Babelscape/multinerd
- http://www.grupolys.org/~marcos/pub/lrec16.tar.bz2 (used in "Incorporating Lexico-semantic Heuristics into Coreference Resolution Sieves for Named Entity Recognition at Document-level")
- Multilingual corpora with coreferential annotation of person entities (Spanish, Galician, Portuguese): http://gramatica.usc.es/~marcos/lrec.tar.bz2
- DrugSemantics Gold Standard (Moreno et al., DrugSemantics: A corpus for Named Entity Recognition in Spanish Summaries of Product Characteristics, 2017): https://data.mendeley.com/datasets/fwc7jrc5jr/1
- DBpedia abstract corpus (English, German, Dutch, French, Italian, Japanese): http://downloads.dbpedia.org/2015-04/ext/nlp/abstracts/
- DAWT dataset - Densely Annotated Wikipedia Texts across multiple languages (English, Spanish, French, Italian, German, Arabic): https://github.com/klout/opendata/tree/master/wiki_annotation
- CANTEMIST (CANcer TExt Mining Shared Task – tumor named entity recognition) - named entity recognition of a critical type of concept related to cancer, namely tumor morphology in Spanish medical texts: https://temu.bsc.es/cantemist/
Catalan
- AnCora (Spanish, Catalan): http://clic.ub.edu/corpus/en
Galician
- Galician NER corpus: https://gramatica.usc.es/~marcos/resources/corpus_gal_nec.txt.gz
- Multilingual corpora with coreferential annotation of person entities (Spanish, Galician, Portuguese): http://gramatica.usc.es/~marcos/lrec.tar.bz2
Basque
- Basque Named Entities Corpus (EIEC): http://ixa.eus/node/4486?language=en
- Basque Disambiguated Named Entities Corpus (EDIEC): http://ixa.si.ehu.es/node/4485?language=en
- Egunkaria 2000 corpus (383 newswire texts), mentioned in http://qtleap.eu/wp-content/uploads/2014/04/QTLEAP-2013-D5.1.pdf
Portuguese
- HAREM: https://www.linguateca.pt/aval_conjunta/HAREM/harem_ing.html
- CINTIL corpus: http://cintil.ul.pt/cintilfeatures.html#corpus
- WikiNER: https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500
- WikiNEuRal: https://github.com/Babelscape/wikineural
- MultiNERD: https://github.com/Babelscape/multinerd
- Multilingual corpora with coreferential annotation of person entities (Spanish, Galician, Portuguese): http://gramatica.usc.es/~marcos/lrec.tar.bz2
- Bosque 8.0 EAGLES format: