EssayGAN: Essay Data Augmentation Based on Generative Adversarial Networks for Automated Essay Scoring
Next Article in Journal
EmotIoT: An IoT System to Improve Users’ Wellbeing
Next Article in Special Issue
Novel Hate Speech Detection Using Word Cloud Visualization and Ensemble Learning Coupled with Count Vectorizer
Previous Article in Journal
Centrifugal Modeling of the Relationship between Tunnel Face Support Pressure and Ground Deformation in Water-Rich Sandy Soil
Previous Article in Special Issue
Parallel Bidirectionally Pretrained Taggers as Feature Generators
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

EssayGAN: Essay Data Augmentation Based on Generative Adversarial Networks for Automated Essay Scoring

1
Department of Radio and Information Communications Engineering, Chungnam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Korea
2
Interaction AI Core, KT Institute of Convergence Technology, Seoul 06763, Korea
*
Author to whom correspondence should be addressed.
Appl. Sci. 2022, 12(12), 5803; https://doi.org/10.3390/app12125803
Submission received: 16 May 2022 / Revised: 1 June 2022 / Accepted: 3 June 2022 / Published: 7 June 2022
(This article belongs to the Special Issue Natural Language Processing: Approaches and Applications)

Abstract

:
In large-scale testing and e-learning environments, automated essay scoring (AES) can relieve the burden upon human raters by replacing human grading with machine grading. However, building AES systems based on deep learning requires a training dataset consisting of essays that are manually rated with scores. In this study, we introduce EssayGAN, an automatic essay generator based on generative adversarial networks (GANs). To generate essays rated with scores, EssayGAN has multiple generators for each score range and one discriminator. Each generator is dedicated to a specific score and can generate an essay rated with that score. Therefore, the generators can focus only on generating a realistic-looking essay that can fool the discriminator without considering the target score. Although ordinary-text GANs generate text on a word basis, EssayGAN generates essays on a sentence basis. Therefore, EssayGAN can compose not only a long essay by predicting a sentence instead of a word at each step, but can also compose a score-rated essay by adopting multiple generators dedicated to the target score. Since EssayGAN can generate score-rated essays, the generated essays can be used in the supervised learning process for AES systems. Experimental results show that data augmentation using augmented essays helps to improve the performance of AES systems. We conclude that EssayGAN can generate essays not only consisting of multiple sentences but also maintaining coherence between sentences in essays.

1. Introduction

An essay-writing test is an important way to assess students’ logical thinking, critical reasoning, and basic writing skills [1]. Students write an essay about a given prompt, and then human raters manually grade the essay according to a rubric. However, scoring by human raters requires an enormous amount of time and effort. Moreover, it is difficult to maintain consistency throughout the grading process. Therefore, automated essay scoring (AES) can relieve human raters of this burden and provide students with more opportunities to enhance their writing skills.
AES aims to automatically rate an essay by analyzing and identifying the main features required for rating it. Traditional AES systems utilize linguistic features crafted manually by human experts. However, selecting and implementing these features requires a great deal of time and effort from human experts. Recent advances in deep learning research have had a major impact on all natural language applications, including AES. The difficulties of feature selection can, in most cases, be mitigated via a deep learning approach. Thus, AES systems based on a deep learning architecture can achieve better performance than traditional systems without encountering the troubles associated with feature selection.
Human-rated essays are indispensable for training an AES system based on a deep learning architecture. However, the high cost of collecting human-rated essays can be a bottleneck in building a cutting-edge scoring system. Automatic data augmentation can be a solution to the chronic problem of a lack of training data. In this study, we propose an automatic essay generator based on generative adversarial networks (GANs) [2] to augment training data automatically for AES systems.
Recently, GANs have shown successful results in text generation. Conventional GANs consist of two sub-networks: a generator that produces fake data and a discriminator that differentiates real from fake data. The core idea of GANs is to play a min–max game between the discriminator and the generator, i.e., adversarial training. The goal of the generator is to generate data that the discriminator believes to be real.
In this study, we introduce an automatic essay generator based on GAN, which we call EssayGAN. To generate essays rated with different scores, EssayGAN has multiple generators and a discriminator. The number of generators in EssayGAN is determined depending on the score range used for grading the essays. As there are multiple generators in EssayGAN, each generator is dedicated to producing only essays with a specific score. Along with this, the discriminator is trained to distinguish between real and generated essays.
In general, text generation by GAN consists of predicting the next token from a set of pre-defined tokens to create the most authentic text. In the same vein, we consider essay generation as the sequence of prediction of the next sentence based on previously chosen sentences. An ordinary-text GAN model predicts the next word in every step, whereas EssayGAN predicts the next sentence.
There are two reasons why EssayGAN samples a sentence rather than a token. One is that GAN has difficulty in generating long text. Even a cutting-edge GAN model cannot generate a well-organized essay with a length of 150–650 words. When generating an essay on a sentence basis, EssayGAN can create a longer essay. The other reason is that we need to generate an essay with a given target score. EssayGAN can easily compose an essay corresponding to a specific target score by sampling sentences from the essays rated with the target score.
EssayGAN can produce as many essays as needed for a given score. Therefore, it can provide a training dataset large enough to train AES systems based on deep neural networks. Experimental results show that data augmentation by EssayGAN helps to improve the performance of AES systems.
The contributions of this study are as follows:
1
We propose EssayGAN, which can automatically augment essays rated with a given score.
2
We introduce a text-generation model that performs sentence-based prediction to generate long essays.
3
We show that essay data augmentation can improve AES performance.
The rest of the paper is organized as follows. We first explore related studies in Section 2, and we present EssayGAN in Section 3. We then present the experimental results in Section 4, followed by the conclusions in Section 5.

2. Related Studies

2.1. CS-GAN and SentiGAN

There are many variations of GANs. One line of research is based on integrating additional category information into GANs. There are two major approaches to handle category information in GANs. Auxiliary classifier GAN (ACGAN) [3] is one of the most popular architectures for categorical data generation. ACGAN deploys an additional classifier layer into a discriminator. A generator is trained to minimize losses calculated by the discriminator and the classifier in ACGAN. Another approach to categorical data generation is to adopt multiple generators, with each generator being dedicated to each category. SentiGAN [4] is a representative GAN with multiple generators.
In text generation, category sentence GAN (CS-GAN) [5], which is an ACGAN, incorporates category information into the GAN to generate synthetic sentences with categories. A generator of CS-GAN exploits long short-term memory (LSTM) networks, which are one of the most common methods used for sentence generation. Category information is added at each generation step. As with ACGAN, the training of the generator is led by the discriminator and the classifier in CS-GAN. A policy-gradient method of reinforcement learning is used to update the parameters of the generator in CS-GAN. In [5], the CS-GAN model generated sentences using various categories of information, with 2 to 13 classes for several datasets. Experimental results have shown that generated sentences with categories can perform well in supervised learning, especially in multi-category datasets.
SentiGAN [4] consists of multiple generators and one multi-class discriminator. The number of generators is determined by the number of categories. SentiGAN is designed to generate text with different sentiment labels. Owing to the multiple generators dedicated to a specific sentiment, SentiGAN can focus on accurately generating its own text with a specific sentiment label. SentiGAN’s discriminator is a multi-class classifier with k classes and an additional fake class when there are k categories. This architecture allows the generators to focus on generating text with specific sentiment labels and avoid generating unacceptable sentiment text.

2.2. Automated Essay Scoring Based on Pre-Trained Models

The advent of pre-trained models such as BERT [6] has resulted in a huge leap forward in the development of natural language applications. The current trend in natural language processing (NLP) research is to employ a large-scale pre-trained language model as an initial model and then fine-tune it with target training data. A few studies of AES have also started using pre-trained models and have shown successful results.
Pre-trained models such as BERT and XLNet [7] were first adopted as the back-bone of AES systems in [8]. The output embedding of the special token ‘[CLS]’ in the BERT model can be considered as the representation of a whole input sequence. An additional linear layer, acting as a scoring function, takes the embedding as an input, and produces a score as output. After the entire model, including the BERT model and the additional linear layer, is fine-tuned with AES tasks, it can grade an input essay. XLNet is a pre-trained model that diminishes the discrepancy between pre-training and fine-tuning by eliminating mask tokens from pre-training data. Instead, it generalizes the autoregressive pre-training method, which enables the learning of bidirectional contexts by permuting input sequences.
A new method called multi-loss was proposed in [9] to fine-tune BERT models with AES tasks. The AES system has the same architecture as the systems mentioned in [8], except that the final layer outputs two results: a regression score and a rank. Therefore, all weights in the AES system are updated with a combination of regression and ranking losses during a fine-tuning process. As training progresses, the significance of the regression loss increases, whereas that of the ranking loss decreases. The multi-loss objective has proven to be an effective approach to build AES systems based on the BERT model.
The AES system proposed in [10] shows that a simple ensemble of pre-trained models, even with a reduction in size, can achieve significant results in AES tasks. The pre-trained models adopted in [10] are the Albert [11], Reformer [12], Electra [13], and Mobile-BERT [14]. All of these models can be used in small-devices with power-efficient computing.
An adapter module for pre-trained models was introduced in [15]. In general, an entire model needs to be fine-tuned over a certain number of epochs with a target task, no matter how good a pre-trained model is. Therefore, the fine-tuning process requires at least several thousands parameter updates, even in the smallest pre-trained model. Instead of fine-tuning the entire model, ref. [15] employed an adapter module, freezing a part of the model and updating only a few thousand parameters to achieve excellent performance. The adapter module leverages the massive knowledge of pre-trained models to achieve high performance with relatively little fine-tuning.

3. EssayGAN

Typically, GANs that generate text sample a token from a set of pre-defined tokens to compose a new sentence. In the same way, EssayGAN samples a sentence from a sentence collection to compose a new essay.
Since EssayGAN composes a new essay by taking a sentence as it is, the newly generated essay consists of the same sentences as those in the essays of the training dataset. However, the newly generated essay differs from those of the training dataset in that it has multiple sources for its sentencesand a different sentence order. We assume that an essay composed through sampling sentences from several source essays can keep a minimal level of coherence, as long as the source essays share the same prompt.
The overall architecture of EssayGAN is shown in Figure 1. The code is available at https://github.com/YoHanPark512/EssayGAN, accessed on 7 June 2022.
Suppose that essays are rated with r grading rubrics, in which case we use r generators and one discriminator. The value of r can be determined as the number of the range of scores. Because EssayGAN has multiple generators based on the range of scores, each generator can be trained only to compose essays corresponding to the selected score. The discriminator is trained to distinguish whether an input essay is real or fake. SentiGAN [4], with multiple generators, shows that the framework with a mixture of generators can make each generator better to generate their own texts. As the multiple generators can help each other, the quality of texts generated by each single generator can be greatly improved.
The i-th generator G i is parameterized with θ i , denotated as G θ i , and the discriminator D ϕ is ϕ -parameterized. The goal of the i-th generator G θ i is to generate essays that can be evaluated with the score c i . Each generator G θ i generates a fake essay that can deceive the discriminator D ϕ while it discriminates between a real and a fake essay.
Since the generators are trained only on a set of essays rated with a corresponding score, they can generate essays with the given score. Hence, the discriminator does not have to evaluate whether a generated essay is appropriate for the score, but can instead focus on identifying whether the essay is real or fake.
We adopted a reinforcement learning method to train the generators. The output scores of the discriminator are given to the generators as a reward value. The generators are trained alternatively with the discriminator until their parameters reach a stable state.
The following sections describe in detail the components in Figure 1, including sentence representation.

3.1. Sentence Representation

Since the generators of EssayGAN take sentences as their input, every sentence should be represented as a unique embedding vector.
To represent all sentences in the training essay data, we adopted language-agnostic bidirectional encoder representations from transformers (BERT) sentence embedding (LaBSE) [16], which produces language-agnostic cross-lingual sentence embeddings for 109 languages.
LaBSE is a pre-trained model built on a BERT-like architecture and uses the masked language model (MLM) and the translation language model (TLM). It is then fine-tuned using a translation ranking task. The resulting model can provide multilingual sentence embeddings in a single model.
In this study, we calculated the sentence embeddings of all sentences in the training data, using LaBSE in advance, and then saved those embeddings in a sentence-embedding table. After that, the embeddings were exploited by the discriminator and the generators of EssayGAN, which are described in the following sections.

3.2. Discriminator

The goal of the discriminator is to distinguish between human-written and generator-composed essays.
The discriminator is built based on bi-directional LSTM networks, as shown in Figure 2. The i-th sentence, s i , in an input essay is converted into an embedding vector, E s i , by looking up the sentence-embedding table, which is described in Section 3.1.
Sentence embeddings are fed into the LSTM hidden states, and the first and the last hidden states are concatenated into a representation of the essay. The final layer of the discriminator outputs a scalar value indicating the likelihood that an input essay is real. The discriminator is trained to output a value as close as possible to 1 for real essays and as small as possible for fake essays. The output value of the discriminator is provided to the generators as a reward value.

3.3. Generator and Reinforcement Learning

Figure 3 depicts the architecture of the i-th generator assigned to generate essays scored as c i . We trained r generators to generate an essay with a specified score. The value of r is determined by the range of scores, which are specified in a scoring rubric.
We utilized LSTM networks as a basic architecture of the generators. The LSTM networks were initially pre-trained with a sentence-level language model using a training dataset and using a conventional maximum likelihood estimation (MLE) method. Therefore, the pre-trained LSTM networks can predict the most likely next sentence based on previously selected sentences. After the pre-training phase, adversarial training was employed to train the generators and the discriminator in turn.
The output layer of each LSTM cell has the same dimensions as a sentence-level one-hot vector that can identify a specific sentence. Each LSTM cell, h t , can be recursively defined according to Equation (1), and a predicted sentence of the LSTM cell can be defined as in Equation (2).
h t = L S T M ( h t 1 , E s t ) ,
p ( s ^ t + 1 S 0 , s 1 , , s t ) = s o f t m a x ( V h t + b ) ,
where E s t is the embedding vector of the t-th sentence s t and S 0 is a start state. V is a weight matrix and b is a bias. The next sentence is chosen through random sampling based on the expectation probability. A new essay is organized according to the sequence of sentences generated by LSTMs.
There is an obstacle involved in applying adversarial training to essay generation. The discriminator can only provide a reward value for a completed essay, whereas the generators require a reward value for incomplete essays at every sampling step. Therefore, to reward the generators at every sampling step, we applied a Monte Carlo search [17] to estimate the next unknown sentence to complete an essay.
The parameter θ i of the i-th generator G i was updated by means of the REINFORCE algorithm [18] using rewards provided by the discriminator D ϕ . The objective of the generator G i is to maximize the expected reward of Equation (3), in which R n is the reward for a completed essay with the length n and Q D ϕ G θ i ( s , a ) is the action-value function of a sequence, i.e., the expected accumulative reward starting from state s, taking action a and following the policy G θ i .
J ( θ i ) = E [ R n S 0 · θ i ] = t = 1 t = n G θ i ( s t S 1 : t 1 ) · Q D ϕ G θ i ( S 1 : t 1 , s t ) ,
where S 0 is the start state, S 1 : t 1 is the previously generated sentence (current state), and  s t is the selected current sentence.
The action-value function Q D ϕ G θ i ( s , a ) is estimated as the output value of the discriminator, as defined in Equation (4).   D ϕ ( S 1 : n ) is the probability that an essay consisting of n sentences ( s 1 , s 2 , , s n 1 , s n ) is real.
Q D ϕ G θ i ( S 1 : n 1 , s n ) = D ϕ ( S 1 : n ) ,
where S 1 : n 1 is a state which specifies a generated sentence ( s 1 , s 2 , , s n 1 ) , whereas s n is an action which selects the n-th sentence.
To evaluate the action-value for an intermediate state, we applied a Monte Carlo search with a roll-out policy G β to sample the last unknown n t sentences. In this work, G β i has the same parameters as those of the generator G θ i . Hence, the action-value function can be calculated using Equation (5).
Q D ϕ G θ i ( S 1 : t 1 , s t ) = 1 K k = 1 K Q D ϕ G θ i ( S 1 : n 1 K , s n k ) , S 1 : n k M C G β i ( S 1 : t ; K ) for   t < n Q D ϕ G θ i ( S 1 : n 1 , s n ) , for   t = n ,
where M C G β i ( S 1 : t ; K ) is a set { S 1 : n 1 , , S 1 : n K } , a K-time Monte Carlo search result.

4. Experimental Results

4.1. Training EssayGAN and Dataset

Algorithm 1 provides a full description of how to perform the adversarial training of EssayGAN. To avoid high discrimination results due to poor generation, the training iteration ratio between the generators (g-steps in Algorithm 1) and the discriminator (d-steps in Algorithm 1) was kept at 1:5.
In addition, the generators are prone to forget what they learn from a sentence-basis language model, which guides basic coherence in writing. Therefore, the MLE-based teacher of the generators intervenes in the training phase of EssayGAN, as shown in lines 7–11 (t-steps) of Algorithm 1. In this study, we kept the training iteration ratio between t-steps, g-steps, and d-steps at 1:2:10.
The hyperparameters for training EssayGAN are described in Table 1.
In this study, we used the Automated Student Assessment Prize (ASAP) dataset, which is the de facto standard dataset in the AES domain for training and evaluation. The ASAP dataset contains 12,978 essays on eight different prompts (essay topics) that were written by students from grades 7–10. A detailed description of the dataset is given in Table 2.
Algorithm 1:Training EssayGAN for essay data augmentation
Input: Generator { G θ i } i = 1 i = r , Policy models { G β i } i = 1 i = r , Discriminator D ϕ , Essay dataset T = { T 1 , T 2 , T r } assorted by r score ranges
Output: well-trained Generators { G θ i } i = 1 i = r
1: Initialize the parameters of { G θ i } i = 1 i = r , D ϕ with random values
2: Pre-train { G θ i } i = 1 i = r on T i using MLE
3: Generate fake essays F = { F 1 , F 2 , , F r } using { G θ i } i = 1 i = r to train D ϕ
4: pre-train D ϕ on { T F } by minimizing the cross-entropy loss
5: β i θ i for eachi in 1:r
6: repeat
7: for t-steps do
8:  for i in 1 : r  do
9:   Train G β i on T i using MLE
10:  end for
11: end for
12: for g-steps do
13:  for i in 1 : r  do
14:   Generate fake essays S ^ 1 : n = ( s ^ 1 , , s ^ n ) G θ i
15:   for tin 1 : n  do
16:    Compute Q ( s = S ^ 1 : t 1 ; a = s t ) by (5)
17:   end for
18:   Update the parameters θ i via policy gradient
19:  end for
20: end for
21: for d-steps do
22:  Generate fake essays F = { F 1 , F 2 , , F r } using { G θ i } i = 1 i = r
23:  Train D ϕ on { T F } using the cross-entropy loss
24: end for
25:  β i θ i for eachi in 1:r
26: until EssayGAN convergence
The number of generators in EssayGAN depends on the score range. However, for prompts 1, 2, 7, and 8, which have a broader score range, EssayGAN requires too many generators, which results in a shortage of data. Therefore, we limited the number of generators for EssayGAN to five. For instance, the scores from 0–60 of prompt 8 were discretized into five partitions using a simple partitioning model. We can obtain a normalized score range by using the partitioning model described in detail in Appendix A.
The normalized score ranges for each prompt are presented in the parentheses of the ‘Score range’ column of Table 2. In summary, EssayGAN adopted four generators for prompts 2, 3, and 4 and five generators for prompts 1, 5, 6, 7, and 8.

4.2. Characteristics of Augmented Essays

We set two baseline models of data-augmentation for comparisons with EssayGAN. Random is a data-augmentation model that generates a new essay by composing randomly selected sentences from essays with the same target score.
All sentences in each essay in the training dataset were assumed to be sequentially numbered. RandomOrder generates a new essay by collecting sentences in non-decreasing order from essays with the same target score to ensure minimal coherence between sentences.
First of all, we observed the characteristics and statistics of the augmented essays, which are summarized in Table 3 and Table 4. We introduced the ‘number of source essays’ and the ‘number of sentences in reversed order’ to examine the consistency and coherence of the augmented essays according to the augmentation techniques used. We augmented the same numbers of essays as those in the training data for each prompt.
Figure 4 depicts an example of the number of source essays and the number of sentences in reversed order to help readers fully understand these concepts. The number of source essays refers to the number of essays from which sentences of a newly generated essay are extracted. We can reason that the consistency of contents in a newly generated essay scarcely remains when the number of source essays is too large. The number of sentences in reversed order refers to the frequency of the occurrence of a decreasing order of sentences in a newly generated essay. This measurement indirectly indicates how well a newly generated essay is organized to maintain coherence.
In the example shown in Figure 4, Generator-3 was trained to compose new essays with a score of 3. In the essay database, each sentence s i j in a source essay indicates a j-th sentence in an i-th essay. The augmented essay has three different source essays (1, 2, and 3) and one sentence in reversed order between s 26 and s 14 .
Table 3 represents the number of source essays according to the augmentation techniques. The Random and RandomOrder categories had almost the same numbers of source essays as the average number of sentences per essay because they collected sentences randomly from the source essays. EssayGAN had less than three source essays for prompts 3–7, of which the average number of sentences was 4.63–11.64. EssayGAN trained without the t-step (in Algorithm 1) had a higher number of source essays than EssayGAN for all prompts.
A comparison of the number of sentences in reversed order is shown in Table 4. Essays generated from RandomOrder did not have a reversed sentence order because they were composed under the strategy of monotonically increasing the sentence order. Random had an average of 5.73 sentences in reversed order in an augmented essay, whereas EssayGAN trained with and without the t-step had 0.51 and 1.01 on average, respectively. We can thus reason that EssayGAN can compose a new essay more coherently than the other augmentation techniques that were considered in this study.
Next, we used a more explicit metric to examine the coherence of the augmented essays. Coherence is a property of well-organized texts that makes them easy to read and understand [19]. Semantic coherence, proposed in [19], quantifies the degree of semantic relatedness within a text.In this work, we adopted this metric for coherence measurements. The local coherence, as expressed in Equation (6), is a measure of the similarity between consecutive sentences in text T with n sentences, whereas the global coherence, as expressed in Equation (7), is a measure of the similarity between every pair of sentences.
c o h e r e n c e L ( T ) = i = 1 n 1 s i m ( s i , s i + 1 ) ( n 1 )
c o h e r e n c e G ( T ) = i = 1 n 1 j = i + 1 n s i m ( s i , s j ) i = 1 n 1 ( i )
In this study, we evaluated the semantic similarity s i m ( s i , s j ) between sentences s i and s j by means of the cosine similarity between the embedding vectors of the sentences encoded by LaBSE [16].
Table 5 shows comparisons of the local and global coherence between sentences according to the augmentation techniques. ‘Training Data + n swapped’ is a real essay in which n sentences were replaced with random n sentences from different essays. The coherence of essays augmented by means of the Random and RandomOrder techniques was too low to be understood. In terms of coherence, EssayGAN without the t-step performed slightly better than EssayGAN. To sum up, the local coherence of essays augmented by means of EssayGAN was almost close to that of the training data with one swapped sentence, and the global coherence was close to that of the training data with two swapped sentences. We conclude that EssayGAN can generate essays of which the coherence is approximately the same as that of human-written essays in which one or two sentences are incorrectly inserted.

4.3. Experimental Results

The main objective of the following experiments was to show whether generated essays were useful as training data for AES systems. Since building an AES system was not the main concern of this study, we employed a simple deep learning architecture for AES systems. The BERT model with an additional layer for producing essay scores was the AES system used in these experiments. Since EssayGAN can automatically generate essays, our AES system was fine-tuned using both real and generated essays.
In the experiments, we adopted the quadratic weighted kappa (QWK) value [20], which is widely used as a de facto standard metric. It measures the agreement of scoring between human raters and AES systems and varies from 0 (random agreement) to 1 (compete agreement). It is defined in Equation (8).
k = 1 i , j W i , j O i , j i , j W i , j E i , j ,
where O i , j of matrix O is the number of essays that receive a rating i from a human rater and a rating j from an AES system. The matrix E is the outer product of the vectors of human ratings and system ratings. The weight entries of matrix W are the differences between rater scores, calculated by W i , j = ( i j ) 2 ( N 1 ) 2 , where N is the number of possible ratings and i and j are the human rating and the system rating, respectively.
To evaluate our approach, we carried out a fivefold cross validation of the ASAP dataset in the same way as in [21,22].
Table 6 summarizes the comparison of QWK correlation according to the data augmentation techniques. The baseline is the QWK correlation of the model trained exclusively on the training data. The simple data-augmentation methods, Random and RandomOrder, were not useful for the AES system. The performances of those methods were lower than that of the baseline.
From the perspective of data augmentation, EssayGAN performed better than Random and RandomOrder. In addition, EssayGAN outperformed EssayGAN without the t-step. This indicates that EssayGAN can mitigate the catastrophic forgetting problem that is pervasive during the fine-tuning process when adopting t-step of Algorithm 1.
CS-GAN [5] is designed to generate new sentences tagged with category information. It has three sub-networks: a generator, a discriminator, and a classifier. To compare EssayGAN with other GAN models, we exploited CS-GAN, in which the classifier is replaced by a scorer to generate new essays rated with score information. The performance of the AES system trained on data augmented by CS-GAN was worse than that of the baseline model. Data augmentation by CS-GAN applied to only source-dependent prompts 3–6, with a small scale of scores in this work. (The data augmentation by CS-GAN for prompts 1, 2, 7, and 8 did not work properly, so we did not include the results for these processes. These prompts required CS-GAN to generate argumentative or narrative essays of 250–650 words. This task seemed to be too difficult for CS-GAN with only one generator).
Table 7 shows a performance comparison with the state-of-the-art AES systems. Our AES system has the same architecture as ‘BERT-CLS’ from [8]; however, it can be trained with more training data thanks to EssayGAN. The R 2 BERT model achieved the best performance among the comparative models by adopting a multi-loss learning strategy. Our AES system achieved the second best performance on average. Based on the experimental results shown in Table 7, we found that the more training data were augmented elaborately by EssayGAN, the better the performance of the AES system, and this performance could be improved up to the level of the state-of-the-art AES models.
In Table A1 of Appendix B, we present examples of the automatically generated essays for prompt 3.

5. Conclusions

In this study, we introduced EssayGAN, which can automatically augment essays rated with specific scores. It consists of multiple generators and a discriminator. EssayGAN can generate as many essays as necessary by sampling sentences from the training set of essays rated with target scores. To the best of our knowledge, EssayGAN is the first attempt to automatically augment text data on a sentence basis. In addition, EssayGAN can maintain coherence between sentences in an augmented essay.
We performed several experiments on the AES task to verify the usefulness of EssayGAN. The experimental results proved that EssayGAN is a reliable data augmentation tool for supervised learning. Therefore, EssayGAN can alleviate the problem of a lack of training data, especially when complex AES systems based on deep learning networks are required. Furthermore, EssayGAN, using multiple generators, can augment essays with a higher quality than that of a conventional GAN using a single generator. A simple AES system, even compared with the state-of-the-art AES models, can yield promising results if it can be trained on more augmented data.
Our future work will include increasing the coherence levels between sentences in an augmented essay and applying EssayGAN to various applications that require augmented data consisting of multiple sentences.

Author Contributions

Conceptualization, C.-Y.P. and K.-J.L.; methodology, Y.-H.P. and Y.-S.C.; software, Y.-H.P.; validation, Y.-H.P.; writing—original draft preparation, K.-J.L. and Y.-H.P.; writing—review and editing, K.-J.L. and Y.-H.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

This work was supported by the research fund of Chungnam National University.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AESAutomated Essay Scoring
ASAPAutomated Student Assessment Prize
GANsGenerative Adversarial Networks
LSTMLong Short-Term Memory
MLEMaximum Likelihood Estimation
LaBSELanguage-agnostic BERT Sentence Embedding
MCMonte Carlo
QWKQuadratic Weighted Kappa

Appendix A. Partition of Score Range

In order to perform the partitioning of the score ranges, we adopted an entropy-based discretization [23] method:
H s ( v ) = i = 1 n p ( v i ) · log p ( v i ) ,
where n is the number of partitions and p ( v i ) is the probability of the i-th partition. When considering entropy only, we can determine the number of partitionssuch that a few partitions become too wide. Therefore, we penalized partitions that were too wide to avoid having them take up most of the partitions. Equation (A2) is the entropy of the partition S in terms of the width of the partition.
H s ( w ) = i = 1 n p ( w i ) · log p ( w i ) ,
where p ( w i ) is the proportion of width of the i-th partition.
The final partitioning model is described in Equation (A3), which is the weighted average of Equations (A1) and (A2). In this experiment, we set α to 0.35. The number of generators for the prompts 1, 2, 7, and 8 can be decided according to the partition S, maximizing Equation (A3).
P a r t i t i o n s = max s ( α · H s ( v ) + ( 1 + α ) · H s ( w ) ) .

Appendix B. Examples of Automatically Generated Essays

Table A1. Examples of augmented essays corresponding to Prompt 3. (The different colors of the generated essays indicate the different sources of sentences).
Table A1. Examples of augmented essays corresponding to Prompt 3. (The different colors of the generated essays indicate the different sources of sentences).
Prompt 3 (Normalized Score = 3)
(1)
number of
source essays
= 2
Being the early summer in souther California, the temperatures are going to be hot.
He had gotten directions from a group of old men who hadent gone the places they told him about in a long time.
Where they said there would be a town it was more like a ghost town.
The day that the cyclist was riding was very hot and he was riding through the desert, @CAPS made it worse.
Because of the heat he was drinking his water regularly to stay hydrated.
This meant he was running out of it fast since the old men had given him directions through places that were
no longer open he couldnt get more water.
The cyclist also had to go over roads that were in bad condition @CAPS slowed him down even more.
Those features at the setting affected the cyclist a lot on his trip.
(2)
number of
source essays
= 3
Author Joe Kurmaskie learns the dangers of a lifeless wilderness during his experience described in do not exceed posted speed limit.
As a cyclist riding in one of the most undeveloped areas in California, the lack of water found in this setting proves crippling to his survival.
The trails he was directed to had no towns or sources of fresh, drinkable water for days.
Another quote that contributed to the cyclists journey was, that smoky blues tune Summer time rattled around in the dry honeycombs of my deteriorating brain.
This expressed how dehydrated the summer heat way making him.
This meant he was running out of it fast since the old men had given him directions through places that were no longer open he couldnt get more water.
The cyclist also had to go over roads that were in bad condition @CAPS slowed him down even more.
Those features at the setting affected the cyclist a lot on his trip.
(3)
number of
source essays
= 3
The features of the setting greatly effected the cyclist.
He was riding along on a route he had little confidence would end up anywhere.
That being the first time ever being on that rode and having only not of date knowledge about it made the cyclist rework.
The temperature was very hot, where was little shade, the sun was beating down on him.
Next, he came to a shed desperate for water only to find rusty pumps with tainted, unsafe water.
Wide rings of dried sweat circled my shirt, and the growing realization that I could drop from heat stroke
This states that he @CAPS the danger of not having enough water in his system.
Features such as water and heat affected him/her throughout the story.
Not having enough water could make him/her lose more sweat and the heat is making him lose even more sweat which can cause extreme heatstroke.
(4)
number of
source essays
= 6
The features of the setting affect the authors dispotion as well as his ability to complete the journey, thus creating an obstacle the author must overcome.
At the end of the paragraph five the author writes that I was traveling through the high deserts of California.
This setting is important because it adds a sense of urgency to his trip when he starts to run low on water.
The terrain changing into short rolling hills didnt help the cyclist.
For example, when the cyclist headed from the intense heat, he needed to find a town.
Its June in California and the cyclist fears he @MONTH soon suffer from heat stroke if he doesnt find water yet.
He said, I was going to die and the birds would pick me clean.
Normally, this would not cause too much struggle, but since he was dehydrated and overheated, each hill seemed crippling,
My thoughts are that if the setting had been a bit cooler and perhaps on the time period the old men had lived in, the cyclist would have had a much more enjoyable experience.

References

  1. Hussein, M.A.; Hassan, H.; Nassef, M. Automated language essay scoring systems: A literature review. PeerJ Comput. Sci. 2019, 5, e208. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the Advances in Neural Information Processing Systems 27, Montréal, QC, Canada, 8–13 December 2014; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2014; pp. 2672–2680. [Google Scholar]
  3. Odena, A.; Olah, C.; Shlens, J. Conditional Image Synthesis with Auxiliary Classifier GANs. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Precup, D., Teh, Y.W., Eds.; Proceedings of Machine Learning Research; PMLR: International Convention Centre: Sydney, Australia, 2017; Volume 70, pp. 2642–2651. [Google Scholar]
  4. Wang, K.; Wan, X. SentiGAN: Generating Sentimental Texts via Mixture Adversarial Networks. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, Stockholm, Sweden, 13–19 July 2018; pp. 4446–4452. [Google Scholar] [CrossRef] [Green Version]
  5. Li, Y.; Pan, Q.; Wang, S.; Yang, T.; Cambria, E. A generative model for category text generation. Inf. Sci. 2018, 450, 301–315. [Google Scholar] [CrossRef]
  6. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MI, USA, 2–7 June 2019; Volume 1 (Long and Short Papers). Association for Computational Linguistics: Minneapolis, MI, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
  7. Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.R.; Le, Q.V. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
  8. Rodriguez, P.U.; Jafari, A.; Ormerod, C.M. Language models and Automated Essay Scoring. arXiv 2019, arXiv:1909.09482. [Google Scholar]
  9. Yang, R.; Cao, J.; Wen, Z.; Wu, Y.; He, X. Enhancing Automated Essay Scoring Performance via Fine-tuning Pre-trained Language Models with Combination of Regression and Ranking. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 16–20 November 2020; pp. 1560–1569. [Google Scholar] [CrossRef]
  10. Ormerod, C.M.; Malhotra, A.; Jafari, A. Automated essay scoring using efficient transformer-based language models. arXiv 2021, arXiv:2102.13136,. [Google Scholar]
  11. Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In Proceedings of the 8th International Conference on Learning Representations (ICLR), Online, 27–30 April 2020. [Google Scholar]
  12. Kitaev, N.; Kaiser, L.; Levskaya, A. Reformer: The Efficient Transformer. arXiv 2020, arXiv:2001.04451. [Google Scholar]
  13. Clark, K.; Luong, M.T.; Le, Q.V.; Manning, C.D. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. arXiv 2020, arXiv:2003.10555. [Google Scholar]
  14. Sun, Z.; Yu, H.; Song, X.; Liu, R.; Yang, Y.; Zhou, D. MobileBERT: A Compact Task-Agnostic BERT for Resource-Limited Devices. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 5–10 July 2020; pp. 2158–2170. [Google Scholar] [CrossRef]
  15. Sethi, A.; Singh, K. Natural Language Processing based Automated Essay Scoring with Parameter-Efficient Transformer Approach. In Proceedings of the 2022 6th International Conference on Computing Methodologies and Communication (ICCMC), Erode, India, 29–31 March 2022; pp. 749–756. [Google Scholar] [CrossRef]
  16. Zügner, D.; Kirschstein, T.; Catasta, M.; Leskovec, J.; Günnemann, S. Language-Agnostic Representation Learning of Source Code from Structure and Context. arXiv 2021, arXiv:2103.11318. [Google Scholar]
  17. Chaslot, G.; Bakkes, S.; Szita, I.; Spronck, P. Monte-Carlo tree search: A new framework for game AI. In Proceedings of the Fourth Artificial Intelligence and Interactive Digital Entertainment Conference (AIIDE), Stanford, CA, USA, 22–24 October 2008; pp. 216–217. [Google Scholar]
  18. Williams, R.J. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Mach. Learn. 1992, 8, 229–256. [Google Scholar] [CrossRef] [Green Version]
  19. Lapata, M.; Barzilay, R. Automatic Evaluation of Text Coherence: Models and Representations. In Proceedings of the IJCAI-05, Nineteenth International Joint Conference on Artificial Intelligence, Edinburgh, UK, 30 July–5 August 2005; pp. 1085–1090. [Google Scholar]
  20. Chen, H.; He, B. Automated Essay Scoring by Maximizing Human-Machine Agreement. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; Association for Computational Linguistics: Seattle, WA, USA, 2013; pp. 1741–1752. [Google Scholar]
  21. Dong, F.; Zhang, Y.; Yang, J. Attention-based Recurrent Convolutional Neural Network for Automatic Essay Scoring. In Proceedings of the SIGNLL Conference on Computational Natural Language Learning (CoNLL), Vancouver, BC, Canada, 3–4 August 2017; pp. 153–162. [Google Scholar]
  22. Tay, Y.; Phan, M.C.; Tuan, L.A.; Hui, S.C. SkipFlow: Incorporating Neural Coherence Features for End-to-End Automatic Text Scoring. In Proceedings of the Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 5948–5955.
  23. Grzymala-Busse, J.W.; Hippe, Z.S.; Mroczek, T. Reduced Data Sets and Entropy-Based Discretization. Entropy 2019, 21, 51. [Google Scholar] [CrossRef] [Green Version]
Figure 1. The overall architecture of EssayGAN.
Figure 1. The overall architecture of EssayGAN.
Applsci 12 05803 g001
Figure 2. The architecture of the discriminator used in EssayGAN.
Figure 2. The architecture of the discriminator used in EssayGAN.
Applsci 12 05803 g002
Figure 3. The architecture of the i-th generator in EssayGAN.
Figure 3. The architecture of the i-th generator in EssayGAN.
Applsci 12 05803 g003
Figure 4. An example of an augmented essay, showing the number of source essays and the number of sentences in reversed order.
Figure 4. An example of an augmented essay, showing the number of source essays and the number of sentences in reversed order.
Applsci 12 05803 g004
Table 1. The hyperparameters used for training EssayGAN.
Table 1. The hyperparameters used for training EssayGAN.
GeneratorDiscriminator
embedding dimension768 (input)
1536 (hidden)
768 (input)
1536 (hidden)
pre-training epoch(steps)1503 (5)
batch size256256
learning rate0.01 (pretrain)
1 × 10 4 (adversarial)
1 × 10 4
Monte Carlo search rollout K16NA
dropoutNA0.2
Table 2. Characteristics of the ASAP dataset. In the column “Type of essay”, ARG denotes argumentative essays, RES denotes source-dependent response essays, and NAR denotes narrative essays.
Table 2. Characteristics of the ASAP dataset. In the column “Type of essay”, ARG denotes argumentative essays, RES denotes source-dependent response essays, and NAR denotes narrative essays.
PromptNumber of
Essays
Average
Length
Average Sentences
per Essay
Score Range
(Normalized)
Type of EssayNumber of
Unique Sentences
1178335022.762–12 (0–4)ARG40,383
2180035020.331–6 (0–3)ARG36,359
317261506.270–3 (0–3)RES10,551
417721504.630–3 (0–3)RES8134
518051506.600–4 (0–4)RES11,614
618001507.780–4 (0–4)RES13,457
7156925011.640–30 (0–4)NAR17,927
872365034.760–60 (0–4)NAR24,943
Table 3. Comparison of the number of source essays according to the augmentation techniques used.
Table 3. Comparison of the number of source essays according to the augmentation techniques used.
PromptNumber of Source Essays
RandomRandomOrderEssayGAN
w/o t-Step
EssayGANAverage Sentences
perEssay
118.0517.857.184.222.76
217.9617.728.244.820.33
35.385.332.452.266.27
44.764.722.372.244.63
55.645.582.412.236.60
66.176.102.782.787.78
710.059.883.222.9911.64
822.4221.9610.376.8234.76
AVG11.3011.144.883.5414.35
Table 4. Comparison of the number of sentences in reversed order according to the augmentation technique.
Table 4. Comparison of the number of sentences in reversed order according to the augmentation technique.
PromptNumber of Sentences in Reversed Order
RandomRandomOrderEssayGAN
w/o t-Step
EssayGAN
19.020.001.410.49
29.640.002.010.75
32.080.000.070.05
41.770.000.040.04
52.260.000.040.03
62.470.000.080.08
74.550.000.300.16
814.020.004.142.51
AVG5.730.001.010.51
Table 5. Comparison of local and global coherence between sentences in essays according to the augmentation techniques. ‘Training Data + 1 swapped’ refers to a real essay in which one sentence was replaced with a random sentence. ‘Training Data + 2 swapped’ refers to one with two swapped sentences. The numbers inside parentheses are values of global coherence. Boldface indicates the highest values of coherence, excluding the ‘Training Data’.
Table 5. Comparison of local and global coherence between sentences in essays according to the augmentation techniques. ‘Training Data + 1 swapped’ refers to a real essay in which one sentence was replaced with a random sentence. ‘Training Data + 2 swapped’ refers to one with two swapped sentences. The numbers inside parentheses are values of global coherence. Boldface indicates the highest values of coherence, excluding the ‘Training Data’.
BaselineData Augmentation (×1)Comparison
PromptTraining DataRandomRandomOrderEssayGAN
w/o t-Step
EssayGANTraining Data
+ 1 Swapped
Training Data
+ 2 Swapped
10.3120.2210.2240.2920.2930.3000.289
(0.278)(0.220)(0.221)(0.254)(0.255)(0.270)(0.262)
20.3200.2150.2200.2940.2970.3030.290
(0.287)(0.214)(0.219)(0.255)(0.259)(0.273)(0.264)
30.2520.1900.1890.2330.2320.2270.211
(0.244)(0.190)(0.185)(0.214)(0.217)(0.221)(0.207)
40.3000.2260.2200.2660.2580.2540.235
(0.297)(0.224)(0.221)(0.256)(0.250)(0.254)(0.236)
50.3340.2500.2540.3070.3070.3020.281
(0.321)(0.249)(0.246)(0.286)(0.285)(0.292)(0.275)
60.2920.2560.2580.2820.2780.2780.269
(0.288)(0.256)(0.255)(0.271)(0.268)(0.276)(0.268)
70.3310.2080.2110.2980.2960.2910.268
(0.303)(0.209)(0.210)(0.260)(0.258)(0.270)(0.250)
80.3320.2140.2190.2930.2930.3210.310
(0.284)(0.214)(0.215)(0.246)(0.246)(0.276)(0.269)
AVG0.3090.2220.2240.2830.2820.2840.269
(0.288)(0.222)(0.221)(0.255)(0.255)(0.266)(0.254)
Table 6. QWK correlation comparison between different augmentation techniques.
Table 6. QWK correlation comparison between different augmentation techniques.
BaselineData Augmentation (×1)
PromptTraining
Data
RandomRandomOrderEssayGAN
w/o t-Step
EssayGANCS-GAN
10.78940.76010.76630.76560.818-
20.67410.65870.65070.69180.696-
30.67960.65250.64620.67760.6860.665
40.80620.76950.77730.81280.8280.788
50.80380.76140.77230.80540.8240.790
60.80900.77670.77940.81660.8290.799
70.82710.80560.81020.82860.865-
80.64540.69890.71100.71950.761-
AVG0.75430.73540.73920.76470.788-
Table 7. Performance comparison with the state-of-the-art AES systems from the literature.
Table 7. Performance comparison with the state-of-the-art AES systems from the literature.
AES Systems
PromptBERT–CLS
[8]
XLNet–CLS
[8]
R 2 BERT
[9]
BERT
–Ensemble
[10]
BERT
–Adapter
[15]
BERT-CLS
+ Augmented Data
of EssayGAN
(Ours)
10.7920.7760.8170.8310.7430.818
20.6790.6800.7190.6790.6740.696
30.7150.6920.6980.6900.7180.686
40.8000.8060.8450.8250.8840.828
50.8050.7830.8410.8170.8340.824
60.8050.7930.8470.8220.8420.829
70.7850.7860.8390.8410.8190.865
80.5950.6280.7440.7480.7440.761
AVG0.7480.7430.7940.7820.7850.788
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Park, Y.-H.; Choi, Y.-S.; Park, C.-Y.; Lee, K.-J. EssayGAN: Essay Data Augmentation Based on Generative Adversarial Networks for Automated Essay Scoring. Appl. Sci. 2022, 12, 5803. https://doi.org/10.3390/app12125803

AMA Style

Park Y-H, Choi Y-S, Park C-Y, Lee K-J. EssayGAN: Essay Data Augmentation Based on Generative Adversarial Networks for Automated Essay Scoring. Applied Sciences. 2022; 12(12):5803. https://doi.org/10.3390/app12125803

Chicago/Turabian Style

Park, Yo-Han, Yong-Seok Choi, Cheon-Young Park, and Kong-Joo Lee. 2022. "EssayGAN: Essay Data Augmentation Based on Generative Adversarial Networks for Automated Essay Scoring" Applied Sciences 12, no. 12: 5803. https://doi.org/10.3390/app12125803

APA Style

Park, Y. -H., Choi, Y. -S., Park, C. -Y., & Lee, K. -J. (2022). EssayGAN: Essay Data Augmentation Based on Generative Adversarial Networks for Automated Essay Scoring. Applied Sciences, 12(12), 5803. https://doi.org/10.3390/app12125803

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop