Since you have read Jurafsky and Martin chapter 21, you know that Named Entity Recognition is the task of finding and classifying named entities in text. This task is often considered a sequence tagging task, like part of speech tagging, where words form a sequence through time, and each word is given a tag. Unlike part of speech tagging however, NER usually uses a relatively small number of tags, where the vast majority of words are tagged with the non-entity tag, or O tag.
Your task is to implement your own named entity recognizer. Relax, youll find its a lot easier than it sounds, and it should be very satisfying to accomplish this. There will be two versions of this task: the first, the constrained version, is the required entity tagger that you implement using scikit learn, filling out the stub that we give you. The second is an unconstrained optional version where you use whatever tool, technique, or feature you can get your hands to get the best possible score on the dataset. There will be a leaderboard for each version.
As with nearly all NLP tasks, you will find that the two big points of variability in NER are (a) the features, and (b) the learning algorithm, with the features arguably being the more important of the two. The point of this assignment is for you to think about and experiment with both of these. Are there interesting features you can use? What latent signal might be important for NER? What have you learned in the class so far that can be brought to bear?
Get a headstart on common NER features by looking at Figure 21.5 in the textbook.
The Data
The data we use comes from the Conference on Natural Language Learning (CoNLL) 2002 shared task of named entity recognition for Spanish and Dutch. The introductory paper to the shared task will be of immense help to you, and you should definitely read it. You may also find the original shared task page helpful. We will use the Spanish corpus (although you are welcome to try out Dutch too).
The tagset is:
- PER: for Person
- LOC: for Location
- ORG: for Organization
- MISC: for miscellaneous named entities
The data uses BIO encoding (called IOB in the textbook), which means that each named entity tag is prefixed with a B-
, which means beginning, or an I-
, which means inside. So, for a multiword entity, like James Earle Jones, the first token James would be tagged with B-PER, and each subsequent token is I-PER. The O tag is for non-entities.
We strongly recommend that you study the training and dev data (no ones going to stop you from examining the test data, but for the integrity of your model, its best to not look at it). Are there idiosyncracies in the data? Are there patterns you can exploit with cool features? Are there obvious signals that identify names? For example, in some Turkish writing, there is a tradition of putting an apostrophe between a named entity and the morphology attached to it. A feature of isApostrophePresent()
goes a long way. Of course, in English and several other languages, capitalization is a hugely important feature. In some African languages, there are certain words that always precede city names.
The data is packaged nicely from NLTK. Get installation instructions here: installing NLTK.
You will be glad to hear that the data is a mercifully small download. See the NLTK data page for for download options, but one way to get the conll2002 data is:
Evaluation
There are two common ways of evaluating NER systems: phrase-based, and token-based. In phrase-based, the more common of the two, a system must predict the entire span correctly for each name. For example, say we have text containing James Earle Jones, and our system predicts [PER James Earle] Jones. Phrase-based gives no credit for this because it missed Jones, whereas token-based would give partial credit for correctly identifying James and Earle as B-PER and I-PER respectively. We will use phrase-based to report scores.
The output of your code must be word gold pred
, as in:
Heres how to get scores (assuming the above format is in a file called results.txt
):
(The python version of conlleval doesnt calculate the token-based score, but if you really want it, you can use the original perl version. You would use the -r
flag.)
Other resources
Here are some other NER frameworks which you are welcome to run in the unconstrained version:
- CogComp NER, one of the best taggers
- LSTM-CRF, recent neural network tagger
- Stanford NER, Stanfords tried and true tagger
- spaCy
- Brown clustering software. You might find it useful.
- Spanish text and vectors
- Europarl corpora, look for the English-Spanish parallel text
Note: you are not allowed to use pre-trained NER models even in the unconstrained version. Please train your own. You are allowed to use pre-trained embeddings.
Baselines
The version we have given you gets about 49% F1 right out of the box. We made some very simple modifications, and got it to 60%. This is a generous baseline that any thoughtful model should be able to beat. The state of the art on the Spanish dataset is about 85%. If you manage to beat that, then look for conference deadlines and start writing, because you can publish it.
As always, beating the baseline alone with earn you a B on the project. In order to earn an A, demonstrate that you have thought about the problem carefully, and come up with solutions beyond what was strictly required. Extra credit for the top of the leaderboard etc.
Deliverables
Recommended readings
- Jurafsky and Martin chapter 21
- Design Challenges and Misconeptions in Named Entity Recognition a very highly cited NER paper from a Penn professor
- Entity Extraction is a Boring Solved Problem or is it?
- Neural Architectures for Named Entity Recognition, a popular recent paper on just read the title.
- Introductory paper to CoNLL 2002 shared task
5/5 – (1 vote)
Here are the materials that you should download for this assignment:
- Code stub.
- conlleval.py: eval script
$ python -m nltk.downloader conll2002
La B-LOC B-LOCCorua I-LOC I-LOC, O O23 O Omay O O( O OEFECOM B-ORG B-ORG) O O. O O
# Phrase-based score$ python conlleval.py results.txt
Here are the deliverables that you will need to submit:
- Code, as always in Python 3.
- Constrained results (in a file called
constrained_results.txt
) - Optional unconstrained results (in a file called
unconstrained_results.txt
) - PDF Report (called writeup.pdf)
Reviews
There are no reviews yet.