Entity Disambiguation for the new ACL Anthology
Written by Gabriel Orlanski
- GROBID PDF parser
- I used GROBID and the python client written by them as well
- You can use any PDF parser, but the results must be in XML files. Please check config.json for the XPaths you need
- PyYAML 5.1.2
- Unidecode 1.1.1
- fuzzysearch 0.6.2
- hurry.filesize 0.9
- lxml 4.4.0
- multiprocessing 2.6.2.1
- nltk 3.4.4
- numpy 1.17.0
- py-stringmatching 0.4.1
- scikit-learn 0.21.3
- scipy 1.3.1
- textdistance 4.1.4
- tqdm 4.33.0
- ujson 1.35
- Run GROBID and its python client on the pdfs
- Run create_data.py to generate the information about the papers, organizations, and manual fixes needed
- Training model (You can skip if you want to use pre-trained models)
- Run preprocess_data.py
- Run train.py
- Create the targets you want to disambiguate (NOT IMPLEMENTED YET)
- Run disambiguate.py (NOT IMPLEMENTED YET)
- If you would like to test the disambiguation program, run evaluate-disambiguation.py
- Check the results.json file, and change any 'same' key to any changes you want to make
- Run update_papers.py to update papers with their new correct authors(NOT IMPLEMENTED YET)
You can use your own model if you would like, but there are a few requirements to do so:
- You must have .predict() and .predict_proba() functions that takes in a 2d array of vectors, the shape of which will be [n,m]
- n is the number of samples to predict
- is the length of each vector
- .predict() must return a np.array() of 1s and 0s, where 1 is the same and 0 is different
- .predict_proba() must return a np.array() of length 2 arrays where the first element is the probability of that the pair are different authors and the second is the probability that the pair is the same author
- For the time being, you must have a .voting attribute, where it is either 'soft' or 'hard'
You can use your own CompareAuthors, please take a look at the compare_authors class for more information on what you need. If you would like to pass specific information to it, take a look at create_training_data.py's getAuthorInfo()* and change it accordingly
* I will try to make it easier to override this function by passing it to the create_training_data