We used two popular fantasy novel corpora, “Harry Potter” (HP) by JK Rowling and “A Song of Ice and Fire” (ASOIF) by GRR Martin.
These dataset are based on English language datasets, see also: English Version of the Datasets.
Here, you find basic information about the dataset, Usage of the reposity, and lemmatization. The overview of evaluation results for the analogy and word intrustion tasks can be found here in RESULTS.MD.
The Russian language datasets can be found in the datasets directory.
Like in the original word2vec-toolkit, the files to be evaluated are named questions
*.
There are four datasets:
datasets/questions_soiaf_analogies_rus.txt
: Analogies relation test data for A Song of Ice and Firedatasets/questions_soiaf_doesn_match_rus.txt
: Doesnt_match task test data for A Song of Ice and Firedatasets/questions_hp_analogies_rus.txt
: Analogies relation test data for Harry Potterdatasets/questions_hp_doesn_match_rus.txt
: Doesnt_match task test data for Harry Potter
If you want to extend or modify the test data, edit the respective source files in the folder datasets:
hp_analogies.txt
, hp_does_not_match.txt
, soiaf_analogies.txt
,soiaf_does_not_match.txt
.
After modifying the test data run the following command to re-create the datasets (the question_
files).
cd datasets
python create_questions.py
This will generate section-based permutations to create the evaluation datasets.
You can also add completly new datasets and add a line into create_questions.py
.
Simply run:
$ python evaluate.py
It will:
-
create questions - generate 4 files in
../datasets
directory for evaluating models:- questions_{hp/soiaf}_analogies_rus.txt
- questions_{hp/soiaf}_doesnt_match_rus.txt
{hp/soiaf}_analogies_rus.txt and {hp/soiaf}_doesnt_match_rus.txt files are used as input files to create the questions (task units).
-
check frequencies - create 2 files in
../datasets
directory:- frequencies_hp.txt
- frequencies_soiaf.txt
this outputs the corpus counts (frequencies) of words (the vocabulary) in the questions files (created in the previous step).
-
create models - create models in
../models
directory:- 5 Word2Vec models
- 2 FastText models with different parameters
script
../models/create_models.py
contains all settings used for model training. -
evaluate analogies - create result files of analogies evaluation in
../evaluation_results
directory:- {hp/asoif}_result_analogies.txt
Files contain information about:
- sections from
questions
files from step 1, - correct/incorrect/total by section,
- number of tasks
- total of whole model (the last number of the last line for each model)
-
doesnt_match_evaluation - create result files for word intrusion in
../evaluation_results
directory:- {hp/asoif}_result_doesnt_match.txt
Files cointains:
- detailed information on difficulty and accuracy,
- number of tasks
- total result of whole model (the last number in the last line for each model)
You can easily change the settings of the evaluation process in the src/analogies_evaluation.py
and src/doesnt_match_evaluation.py
files.
Also you can skip some steps like 1. create questions and 2. check frequencies because there is no need to rerun these scripts if you haven't changed the dataset files.
In our experiments we found, that lemmatization has a strong positive impact on evaluation results. Lemmatization raises the average frequency (in the corpus) of the terms in the dataset (vocabulary):
Analogies:
Word intrusion:
As can be seen, corpus lemmatization helps to increase term frequencies. The low counts on the raw text of caused by the rich morphology of Russian language, where words appear in many different forms because of a Russian grammar. Example: Russian:
Озеро - за Озером
English:
Lake - behind the Lake
Lemmatization changes all forms to their base form and in general helps to achieve better results in our setting (of small corpora).
Extensive evaluation results for the analogy and word intrustion tasks can be found here in RESULTS.MD.