Keywords

1 Introduction

Since the use of social networks has been constantly increasing in recent years, social platforms like Twitter, Facebook and LinkedIn among others have been a potential source of information, since users write and share opinions, personal experiences and communication messages among groups of people. The selected platform for the development of this research is Twitter, where users write messages with a 140 character limit; these messages are called “tweets”, and on average, 58 millions tweets are emitted daily [1].

It is because of this, that sentiment analysis may be a powerful tool for companies to assess their brand, as well as various campaigns that they may conduct, and thus being able to make decisions regarding them. For this purpose, automated text classification may be used, which corresponds to the process of labeling texts in predefined thematic categories, given certain characteristics. Therefore, the automated text classification has been a subject of research that proposes, as it names implies, the automation of the tasks involved in the classification of manual texts. It has been used in several applications as well, such as filtering emails and organizing information.

The traditional approach of the automation of this process, is focused primarily on the use of supervised learning algorithms for text classification [13]. Supervised learning in the context of text classification uses pre-labeled data as an input for an algorithm, that then learns the patterns of the training set by probabilistic means (Naive Bayes) or decision rules (Decision Trees), among other mechanisms.

One of the not so well studied characteristics of sentiment analysis is the potential change of opinion concerning the topic that users are discussing. With a traditional approach of automated text classification this presents a problem, in that once the algorithm is trained, it only learned from a particular time frame. It is for this reason that we propose a novel method, by means of techniques based on artificial intelligence proposed by [12], where the implementation of a framework based on intelligent agents provides the classifier a method and a criterion for its retraining, as required. The use and implementation of this framework is justified due to the lack of studies oriented at the use of intelligent agents on tweet classification, more specifically, sentiment analysis. For this reason, it is expected that this research will supply significant knowledge for both areas.

This study is structured as follows: the next section presents the State of the Art. Section 3 presents a more detailed description of the problem and the dataset used in this study. Later, Sect. 4 shows the proposed methodology with a detail of the agents that compose the multi-agent system. The experimental results are showed in Sect. 5 and in the last section, a brief discussion of the results is presented with some insight into possible future work.

2 State of the Art

The baseline of this paper is inspired in the survey by [13] where the text classification problem is discussed thoroughly. In this work we explain the following: the subset of problems that derive from text classification, diverse methods for preprocessing the texts, supervised machine learning, algorithms for automating the process and the performance metrics used for the evaluation of these algorithms. In terms of intelligent agents, Russel and Norvig [12] provided a solid guideline defining the baseline for different types of agents. For this paper, we used the learning agent as an inspiration for implementing the intelligent agent framework.

In the context of the Twitter platform, the use of intelligent agents is more linked to information retrieval [5], simulation of user behavior facing rumors [14] and online influencing simulation [10]. However, as for the use of intelligent agents for text classification, studies out of the scope of sentiment analysis were found. Particularly, how to implement collaborative learning systems [11] and how to train the agents in game oriented environments [2].

Regarding the current state of the studies on text classification and sentiment analysis on Twitter using machine learning techniques, there is a larger body of literature. There are studies oriented to testing several supervised machine learning algorithms [9], the application of distant supervised machine learning approach [7], and the use of several text representation techniques applied to tweets [3].

3 Problem and Data Used

3.1 Problem Definition

In the subject of sentiment analysis of Twitter, with the massification of the use of social networks, companies can use Twitter for directing marketing campaigns for certain products, or for analyzing the opinion of the clients about the brand; these actions can be done in order to make beneficial decisions. In conjunction with this, another less explored aspect of sentiment analysis, corresponds to a possible change in the sentiment/opinions of the users with regard to the topic which is under analysis. This is an interesting point, because in a traditional work flow with machine learning algorithms, a model trained with this approach will perform poorly in the detection of those changes in opinions. The main cause of this is that the model is only trained once, and traditionally with a large enough dataset as to introduce as many possible cases to train the classifier.

For this very reason, it is crucial for the development of this study, to create a mechanism which can detect important differences between sets of tweets, in order to be able to use re-training techniques for the classifier. This mechanism can be useful not only in the context of detecting changes in opinions, it can also be used to see and learn from sets with words that have never been seen before in the training set.

3.2 Data

The dataset used in this project, is the original dataset used by [3], which corresponds to a collection of tweets emitted by people in relation to two Chilean retail companies. This dataset was obtained between the years 2013 and 2014, the tweets were labeled manually by a group of five people, and were granted by the company Analitic S.A. for the development of this study.

The main difference with the first use of this dataset, is that in this study, 3000 were used in total, with 1000 for each label available in the dataset: positive, negative and neutral tweets. The distribution of these tweets in terms of each retail company, was made evenly across each possible label.

It is worth noting that preprocessing techniques for the texts were applied before the training/testing process of the classification algorithm. The preprocessing consisted in the elimination of stopwords, symbols, numbers, hyperlinks and the term “RT”. The latter was removed because it is a convention term used in Twitter for replicating a message emitted by another user, and a possible way to add commentaries given the 140 characters limitation.

In this way the final dataset is built, which will be used as a document pool for interacting with the distributor agent. This agent and all the intelligent agent framework will be described in the following section.

Fig. 1.
figure 1

Architecture of the multi-agent system, showing the 3 main agents and how the communication flow is directed

Fig. 2.
figure 2

Activity diagram depicting the setup phase for the three agents

4 Methodology

4.1 Multi-agent Architecture

To address the problem described in the previous section, a multi-agent system composed by three agents was chosen. These agents are: the distributor agent, the critic agent and the classifier agent. In the literature related to these systems [16], various methods are proposed on how the agents interact with the environment and with each other. In this sense, an approach focused on a hybrid layer placement of the agents (combining both vertical and horizontal stacking) has been taken into account. This means that not all the agents interact with the environment, in particular, the critic agent only interacts with the distributor and classifier agent.

Fig. 3.
figure 3

Activity diagram depicting the iterations phase for the three agents

The positioning of the agents on the hybrid layer is as follows: in a first layer is the distributor agent, which interacts with the environment (document pool), then this agent collects a number of tweets from the pool and afterwards sends this set to both classifier and critic agent.

The critic agent is equipped with four elements: the knowledge of the critic/classifier (the set of tweets with which the classifier agent was trained), the mechanism for detecting the difference between a set of tweets and the knowledge (t-student analysis and histogram comparison) and finally a dimensionality reduction mechanism for the knowledge analysis (term ranking).

Finally, the classifier agent implements a supervised learning algorithm for the core mechanism of the classifier. The existing interaction between the classifier and the critic agent, corresponds to communication messages concerning certain behaviors of the agents, for example, coordination messages regarding their shared knowledge. It should be noted that the output of the classifier agent is the classification of the set of tweets sent by the distributor agent.

Next, in the Fig. 1 a diagram depicting the multi-agent system architecture is presented.

Interaction Between Agents. Next, in the Figs. 2 and 3, an activity diagram between the agents is displayed. Mainly the Fig. 2 represents the first iteration when the first set of documents arrives to both agents and there is no previous knowledge to make a comparison. The Fig. 3 shows the process of the multi-agent system through all iterations after the setup phase.

In the next subsections, the functionality of each agent will be further elaborated.

4.2 Distributor Agent

Main Functionality. The distributor agent has the purpose of reading n tweets out of a document pool (environment), and distributing those to both critic and classifier agents. For the test implementations, the process of obtaining the n tweets of the document pool is the feature that adds the temporal aspect to the tests, simulating tweets batches, that the agent picks for making the iterations. However, in a real world application, this agent has to pick the tweets in real time and create the batches of n size for later distribution.

For the implementation of the test, the distributor agent also knows beforehand the true classes of the tweets. This allows the re-training process of the classification agent, and evaluate its performance; but in a commercial application, this agent can use an approach based in [7] for the automatic labeling of a Twitter live feed. On the contrary, if this technique is not applied, the distributor agent should storage that set of tweets in the case that a re-training message is triggered, for later labeling by a human expert and the re-training process.

4.3 Classifier Agent

Main Functionality. The purpose of the classifier agent, as its name suggest, is to classify the tweets sent by the distributor agent. The algorithm of choice for the classification core of the agent is the Naïve Bayes algorithm [12, 13], specifically the multinomial Naïve Bayes represented by the following equation:

$$\begin{aligned} P(c_j|d) = \ln {P(c_j)} + \sum _{i=1}^{M}\ln {P(t_i|c_j)} \end{aligned}$$
(1)

where \(c_j\) corresponds to a j class, d to any document (or tweet for the scope of this study) and \(t_i\) to a term i present in the document d.

The multinomial Naïve Bayes was used in that is one of the best models that fits the feature vector used for the bag of words representation [8]. Also, this model can handle underflow errors, when handling low value probabilities caused by low term frequency in the trained corpus.

Re-training of the Classifier Model. This agent includes a re-training mechanism, triggered by a message given by the critic agent. This mechanism consists on adding a subset of the new feature vectors to the trained model; these subsets are generated by splitting a percentage of the set of tweets sent by the distributor agent randomly. For re-training purposes, this agent asks the distributor agent to obtain the true classes.

Fig. 4.
figure 4

A matrix representation of two set of documents \(A_t\) and \(A_{t+1}\), with \(d_i\) representing each document of a set, and \(t_j\) representing as a term j for a document

4.4 Critic Agent

Main Functionality. The main functionality of the critic agent lies in its knowledge. Using different statistical measures to the knowledge in a similar manner like [15], relevant information is obtained. These metrics are: T-student test, Histogram comparison test and the use of term ranking.

The main purpose of applying those analyses to the classifier agent knowledge, is to see if there are differences with another tweet corpus. Formally, given a document corpus D and two subsets \(A_t\) and \(A_{t+1}\), both belonging to D, the hypothesis \((A_t - A_{t+1} \sim N(0,1))\) must be proved. An example for a representation of D is presented in the Fig. 4.

Another important point for the development of the critic agent in this study, is the existence of two assumptions: first, that the subset \(A_t\) is independent and identically distributed (i.i.d); this implies that the documents that belongs to the subset \(A_t\) are i.i.d., and the words of those documents also are i.i.d.. The second supposition is that both subsets \(A_t\) and \(A_{t+1}\) are temporal data fluxes (in time steps t), and this enables the subset statistical analysis in such a way that, if the subset \(A_t\) is normally distributed, then there should not exist a significant difference between future temporal subsets. If the opposite happens and there is a significant difference given a certain threshold, then the critic agent sends a message to the classifier agent, indicating that \(A_{t+1}\) has to be retrained. When this retraining process is repeated with future subsets \((A_{t+2}, A_{t+3}, ..., A_{t+n})\). It is expected that there should be a smaller difference for each re-training iteration, decreasing the possibilities of finding significant differences with future subsets, and therefore, reducing the retraining instances.

Knowledge Limit. Another mechanism added to the critic agent, is the restriction of the amount of tweets (or documents) that are handled in the bag of words representation (knowledge). The main reason for the use of this mechanism has to do with the scope of the sentiment analysis data used in this study. Here, forgetting old tweets can be useful for the classifier to learn possible changes in opinions and trends for each iteration of the tweets.

To establish the knowledge limit, for each iteration wherein the classifier agent receives a re-training message from the critic agent, a verification is made to monitor if the trained tweets plus the new quantity of tweets to re-train does not exceed a certain limit. If it does not exceed the limit, the new tweets are added to the knowledge by updating the bag of words representation. Otherwise, a num variable is calculated, indicating the quantity of how many documents need to be removed to avoid surpassing the limit. Then the first num tweets are removed from the knowledge and the new tweets are added to the bag of words representation. Finally, a new instance of the Naive Bayes algorithm is trained with the new bag of words representation.

Next the statistical methods used by the critic agent are explained:

T-Student Test. T-student test is part of hypothesis-contrast tests, wherein given a certain hypothesis, if the null hypothesis is not rejected, the test shows no statistical difference between both distributions. For this, the mean values and the variances of these sets are used, along with two statistical hypotheses: on the one hand, the null hypothesis, which assumes that the means of both distributions are equal; and the other, the alternative hypothesis which indicates that they are different. The above is represented in the equation

$$\begin{aligned} H_0 :\mu _a = \mu _b \end{aligned}$$
(2)
$$\begin{aligned} H_1 : \mu _a \ne \mu _b \end{aligned}$$
(3)

In order to reject the null hypothesis it is necessary to calculate the result of the t-test and contrast it with a critical value. If the test result is bigger than the critical value, the null hypothesis is rejected and the means of both sets of data are said to be different. In the case that the result is smaller than the critical value, the null hypothesis is not rejected, however it can not be guaranteed that the means of both sets are equal.

The result of the t-test t is calculated using two formulas: Tests with equal variances and different size of the samples (Eq. 4) and tests with different variances and different size of the samples (Eq. 5).

$$\begin{aligned} t = \frac{\bar{X_1} - \bar{X_2}}{S_{x_1x_2}\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}} \,, S_{x_1x_2} = \sqrt{\frac{(n_1-1)S_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2}} \end{aligned}$$
(4)
$$\begin{aligned} t = \frac{\bar{X_1} - \bar{X_2}}{\sqrt{\frac{(S_1)^2}{n_1} + \frac{(S_2)^2}{n_2}}} \end{aligned}$$
(5)

Since it can not be assumed that for these samples the variances are equal, a F-test is performed to check the sameness of the two variances. The result of this test is calculated using the following equation

$$\begin{aligned} F= \frac{(S_1)^2}{(S_2)^2} \end{aligned}$$
(6)

This result indicates that the more deviated is the proportion from 1, the stronger is the evidence that these are different. As for the t-test, two hypotheses are contrasted: the null hypothesis that the variances are equal for both sets, and the alternative hypothesis, which assumes that both variances are different. This value F is contrasted with its critical values, and if these are exceeded, it can be said that both variances are different.

Histogram Comparison. A histogram is the representation of the underlying frequency distribution of a set of continuous data. In the analysis of histogram differences, we take two sets of data A and B, with their corresponding histograms H(A) and H(B), and probability density functions P and Q respectively, and proceed to calculate the distances between each element of the density functions. For this, a Sorensen distance [6] is used, which is calculated by:

$$\begin{aligned} d_{sor} = \frac{\sum _{i=1}^D |P_i - Q_i|}{\sum _{i=1}^D (P_i + Q_i)} \end{aligned}$$
(7)

On the one hand, when the value of this distance is close to 0, it implies that distribution of both sets of data A and B are similar. Conversely, if the value tends to move farther from 0, the data sets are said to be different.

Term Ranking. In a corpus, the number of words within this scale increases rapidly both by the number of texts and by the nature of the texts. In the case that the texts are represented in a corpus as in Fig. 4, a two-dimensional matrix DxT, where D corresponds to the documents and T to the terms that exist within the documents. In order to reduce the size of this matrix, it is possible to perform feature selection operations which will filter unwanted words, since they provide information that is not relevant to the model.

Once the dimensionality is reduced, these terms can be grouped by the number of total occurrences in the corpus in descending order, generating a histogram for that new matrix. The ranking of terms for this project was carried out in two ways: performing the ranking for n terms from highest to lowest according to their frequencies, and applying the \(tf-idf\) representation, which calculates the frequency of a term and multiplies it by the inverse document frequency where this word appears. Once these weights are applied to each word of the matrix DxT, the words with lower \(tf-idf\) values are eliminated, to afterwards make a ranking of n terms according to their \(tf-idf\) values for this new matrix.

4.5 Implementation

The implementation of the project was carried out on Julia [4], which meets the characteristics of being a high level language, oriented to the dynamic programming of high performance, and being relatively simple for the realization of prototypes. Although Julia offers a wide set of tools for the mathematical or statistical calculation used mainly by the critical agent, at the date of creation of this report, there is no framework for the creation and simple implementation of software agents and multi-agent systems. For this reason a basic implementation was made using the network protocols provided by Julia: each agent is a server, and the communication is done by sending messages through the tcp/ip protocol.

5 Experimentation

5.1 Test Configurations

First, two control tests were carried out, which consisted on designing a classifier that was trained only on the first instance, and a classifier that was trained in each instance in which the tweets arrive. These control tests will serve to compare the results obtained in the following experiments. A total of ten instances were carried out for each configuration of the test, and the results were averaged to obtain the performance measures.

T-Student Approach. The tests of the critic agent implementing t-student as a measure of comparison between texts entailed different configurations. In order to observe its effects on the performance of the classifier given its configuration. The different iterations involved the implementation of the agents with the following possible configurations:

  • Knowledge limit: [500, 1000, 1500, 2000]

  • Number of document sent by iteration: [100, 150, 300]

  • Term ranking: [0, 100, 500, 1000]

  • Split value for re-training: \([50\%, 70\%, 90\%]\)

  • Critic threshold value: [0.15, 0.2, 0.25, 0.3]

  • Tf-idf use: [TrueFalse]

The amount of possible results that can be obtained by combining all the configurations described above corresponds to 576 tests, which were repeated 10 times each in order to obtain the average for each configuration. After obtaining the averages, we proceeded to eliminate all the results which were not re-trained, and those that re-trained in each iteration. This is done because they present similar behaviour to that of the control tests 1 and 2 respectively. This filtering process left a total of 147 configurations that meet the criteria described above. That is, in \(74.48\%\) of configuration instances, one of the two filtering criteria was fulfilled (re-training all instances and not re-training any) leaving the rest for the result analysis.

Histogram Comparison Approach. For the tests of the critic agent implementing the difference of histograms as a measure for text comparison, similar configurations to the tests of the t-student approach were used, and others were also carried out with different parameters to see the effects of these configurations. The tests were:

  • Knowledge limit: [1000, 2000]

  • Number of document sent by iteration: [150, 300]

  • Term ranking: [0]

  • Split value for re-training: \([50\%, 70\%, 90\%]\)

  • Critic threshold value: [0.5, 0.1, 0.15]

  • Tf-idf use: [False]

The number of possible results that can be obtained by combining these configurations corresponds to a total of 36 tests; these, as well as the tests of the t-student approach, were repeated 10 times, obtaining the average time for each configuration. The filtering process was then performed to eliminate the re-trained instances which presented the same behaviour that the control tests 1 and 2, which resulted in a reduction of 36 to 26 configuration instances corresponding to an elimination of \(27.78\%\) of configuration instances according to filtering criteria.

5.2 Evaluation Metrics

Regarding the evaluation of the performance of text classification algorithms, different methods have been established. The relevant literature [13] proposes that the accuracy and the \(F_1\)-score of the classifier should be analyzed, and in order to carry out these analyzes, the possible results of the classification of a document must be determined. These can be represented in a confusion matrix.

Figure 5 presents the confusion matrix according to the possible classification results of a document, which correspond to: true positive (TP) if a document was correctly classified in the category that it belongs to, false positives (FP) if a document was incorrectly classified in a class where it does not belong; false negatives (FN) if a document is not classified in a category where it should be, and true negative (TN) if a document was not classified in a category that does not match.

Fig. 5.
figure 5

Representation of confusion matrix in a binary classification problem.

Accuracy: Accuracy (A) is the measure related of how the classifier predicted instances correctly through the whole sample. It is defined as:

$$\begin{aligned} A=\frac{TP+TN}{TP+TN+FP+FN} \end{aligned}$$
(8)

\({{\varvec{F}}}_{\varvec{1}}\)-Score: The \(F_1\) score is another way to measure performance, and considers the precision \((\pi )\) and recall \((\rho )\). The \(F_1\)-value can range between \(\left[ 0,1\right] \), with higher values indicating a good performance of the classifier. It is defined as:

$$\begin{aligned} F_1 = \frac{2\pi \rho }{\pi + \rho } \end{aligned}$$
(9)

where \(\pi \) and \(\rho \) are defined respectively as:

$$\begin{aligned} \pi = \frac{TP}{TP + FP} \end{aligned}$$
(10)
$$\begin{aligned} \rho = \frac{TP}{TP + FN} \end{aligned}$$
(11)

6 Results

The 10 best results of the tests given the various configurations described above are reported. The configurations of this tests are listed in the Table 1.

Table 1. Configurations for the 10 best test performances
Table 2. Performance metrics for both control and best tests.

As a summary, Table 2 is presented, detailing the best accuracy for all the iterations, average accuracy through all the iterations together with the best \(F_1\)-score for all iterations, and the \(F_1\)-score average. Test number 6 obtained the best accuracy score and the \(F_1\)-score, through all iterations, while test number 8 obtained the best average accuracy and average \(F_1\)-score. The test with less variation of the \(F_1\)-score, corresponds to test number 10.

Next, a summary of the re-training statistic for the top 10 tests is shown in Table 3. The best average re-training was obtained by test 8 and 9 (the lower the value, the better).

Table 3. Retraining metrics for both control and best tests.
Table 4. Two-tailed t-student test results for both control and best tests.

6.1 Significance Test

In conjunction with the results showed above, two-tailed t-student tests were performed through all 12 tests (including control tests) in order to see if the improvements were statistically significant. Table 4 shows the results, displaying in numerical value all the tests that show a statistically significant difference; if there is no significant difference, NSS (No statistical significance) will be displayed. In the case of control test 1 and 2, all results correspond to improvements when compared with all other tests. The other result exhibited with this table shows that there is no significant difference when comparing the tests. This could indicate that in terms of an implementation of this method to retrain the classifiers, one could choose the most cost effective method in terms of score performance \(F_1\)-score, versus re-training ration of the classifiers.

7 Conclusions

In this work an Intelligent Agents Framework for automatic tweets classification was proposed. The framework consists mainly of three types of agents: critic, classifier and distributor. For the critic agent two approaches were implemented. The first one is based on hypothesis tests (t-student test), which takes into account if there are statistical differences in the distribution of terms, and decides if it is necessary to retrain. The second approach, makes an analysis of the difference of histograms to make the same decision.

The experiments suggest that the implementation of the critic agent with all the mechanisms, improved the performance of the classifier agent. In addition, it was shown that a classifier that retrains in all iterations is not the optimal approach to solve the problem, as can be seen in the results.

It should be noted that the best classifier in terms of performance for all tests was obtained in the test configuration 6 in terms of accuracy and \(F_1\)-score. Furthermore, when analyzing the configuration that obtained the least amount of re-training for all the tests selected, it was found that the best classification was in test configuration 8.

As future work, the modification of the critical agent will be addressed, either by implementing other statistical tests for comparison, or by improving the same tests already implemented and possible actions to be performed. Regarding the classifier agent, it is possible to make improvements either by implementing different classifiers (support vector machines, decision trees, etc.), or by focusing on re-training, by creating an ensemble of classifiers that predicts cooperatively the set of tweets.