NEURAL NETWORKS WITH EMOTION ASSOCIATIONS, TOPIC MODELING AND SUPERVISED TERM WEIGHTING FOR SENTIMENT ANALYSIS

Automated sentiment analysis is becoming increasingly recognized due to the growing importance of social media and e-commerce platform review websites. Deep neural networks outperform traditional lexicon-based and machine learning methods by eﬀectively exploiting contextual word embeddings to generate dense document representation. However, this representation model is not fully adequate to capture topical semantics and the sentiment polarity of words. To overcome these problems, a novel sentiment analysis model is proposed that utilizes richer document representations of word-emotion associations and topic models, which is the main computational novelty of this study. The sentiment analysis model integrates word embeddings with lexicon-based sentiment and emotion indicators, including negations and emoticons, and to further improve its performance, a topic modeling component is utilized together with a bag-of-words model based on a supervised term weighting scheme. The effectiveness of the proposed model is evaluated using large datasets of Amazon product reviews and hotel reviews. Experimental results prove that the proposed document representation is valid for the sentiment analysis of product and hotel reviews, irrespective of their class imbalance. The results also show that the proposed model improves on existing machine learning methods.


Introduction
Sentiment analysis is intended to reveal users' real opinions or attitudes toward different aspects of products and services. 1 For example, consumers tend to post their reviews on online shopping platforms, particularly when their experience was exceptionally good or bad. Product reviews also help businesses and other consumers understand consumers' concerns and make purchase decisions, respectively.
The last two decades have witnessed considerable developments in automated sentiment analysis, which has become a widely studied text categorization task. Its aim is to label text documents as having a positive or negative orientation. Sentiment orientation has a major impact on the perceived helpfulness of online comments. 2 The steadily increasing number of online comments across major shopping platforms and social media has led to the necessity of developing automated sentiment analysis systems. 1 On one hand, various sentiment lexicons have been constructed to produce sentiment scores. On the other, numerous machine learning models have been proposed for the task, including ones with unsupervised, 3 semi-supervised 4 and supervised learning. 5 Three levels of granularity have been considered in the sentiment analysis of online comments, namely the document, sentence and aspect levels. At the document level, it is assumed that sentiment is consistent within the online comment, which are categorized into positive or negative sentiment classes. In other words, this classification task assumes that the online comments concern a single entity. For the sentence-level sentiment analysis, the classification task only selects and considers opinion sentences. For the aspect-level sentiment analysis, it must first identify the comment's aspect (target), which in turn leads to two subtasks, namely aspect extraction and aspect sentiment classification.
Looking at the features used for sentiment analysis, the bag-of-words model represents a traditional document representation that calculates term frequencies for each word or phrase in the vocabulary. 6 However, this approach suffers from highdimensional sparse document representation. Moreover, only a limited context can be taken into account when using n-grams rather than single words. To address these issues, scholars introduced word embeddings to generate low-dimensional dense word representations. [7][8][9][10][11] Word embeddings are also more effective than the bag-of-words approach in modelling word context and word meaning.
Deep neural networks (DNNs) have recently attracted particular interest and proved an effective text and image classification tool due to their capacity to learn complex feature representations. [12][13][14][15][16][17][18] To avoid the above-mentioned high-dimensional sparse word representations, DNNs utilize word embeddings to model local word context. This in turn leads to a lower-dimensional dense word representation. Alternatively, DNNs can be used to produce such word representation and by averaging all words in the document provide inputs to other machine learning-based classification models, such as support vector machines (SVMs). 39 A major issue with traditional word embeddings is that they fail to consider the sentiment of the words in terms of both sentiment polarity and intensity. Moreover, different aspects of comments are often neglected. DNNs proved an emerging prospect for aspect extraction and the sentiment analysis of online comments due to their ability to capture both semantic and syntactic high-level features without requiring prior feature engineering. 39 The aspectspecific word embedding model proposed by Du et al. 20 remains the only study investigating word vectors with respect to topics extracted using latent Dirichlet allocation (LDA). However, this approach also suffers from some serious drawbacks. First, as with all the previously mentioned methods, the sentiment polarity / intensity of words is overlooked. Second, such models consider the topics regardless of the capacity of words to discriminate between positive and negative comment orientation, particularly when considering slow LDA inference.
This study aims to overcome the above problems by developing a DNN model with richer document representation, which integrates word-emotion associations, topic modeling component and supervised term weighting. This document representation builds upon recent work combining word embeddings with sentiment scores. 5 However, several major novelties are presented in our model. First, compared with previous work, multiple lexicon-based sentiment and emotion indicators are used that provide the words contained in word embeddings with a more thorough assessment of sentiment polarity (positive, negative, or neutral), sentiment intensity (strength of positive and negative sentiments) and emotions (mood states), 21-23 including emoticons and negating words. Furthermore, this novel document representation is combined with a topic modeling component performed using LDA. Finally, this is the first study to demonstrate the effect of supervised term weighting in a DNN model for sentiment analysis. Specifically, bag-of-words is selected based on a supervised term-weighting scheme, thus considering terms' power to discriminate between positive and negative sentiment orientation. Supervised learning is preferred in this study because a large number of labeled training documents can be obtained from existing datasets for sentiment analysis. In summary, the contributions of our study are twofold: • A novel DNN-based sentiment analysis model is proposed that, as far as we know, is the first to integrate word-emotion associations with a topic modeling component and computationally effective bag-of-words component. • Two benchmark datasets of Amazon product reviews and hotel reviews are used to demonstrate the effectiveness of the proposed integrated document representation model in sentiment analysis, and report significant improvements of classification performance over state-of-the-art sentiment analysis methods.
This article is a significantly extended version of the conference paper, 24 which demonstrated the effectiveness of word-emotion association for sentiment analysis. Here, an improved sentiment analysis model is proposed that is equipped with the topic modeling component and supervised term weighting. This allows us to examine the effects of different document representations on sentiment classification performance. In addition, an in-depth comparative statistical analysis is performed against existing sentiment analysis methods on the Amazon product review and hotel review datasets.
The remainder of this paper is structured as follows. Section 2 reviews recent advances in the automated sentiment analysis of online comments. Section 3 introduces the details of the proposed sentiment analysis model. Section 4 presents the datasets used for model evaluation. Section 5 presents experimental results and compares the model performance with existing models. Section 6 concludes, highlighting further research directions.

Related Work
Over the last two decades, there has been a considerable amount of literature on the automated sentiment analysis of online comments. Notably, recent years have seen considerable interest in DNN-based approaches. This section reviews previous machine learning-based approaches to the sentiment analysis of online comments, as presented in the list of related studies in Table 1.

Bag-of-words Models
As shown in earlier studies, neural networks (NNs) outperform other traditional machine learning methods such as SVM and Naïve Bayes (NB) in this classification task, regardless of the context of balanced/unbalanced datasets. 25 The traditional approach uses the bag-of-words model to generate sparse and high-dimensional document representation. 26,27 However, shallow NNs have a limited ability to deal with sparse datasets. 28 By contrast, DNNs can capture more complex features from documents. Glorot et al. 29 proposed a DNN approach employing unsupervised learning to demonstrate that effective word representation is possible by learning a stacked denoising autoencoder. They also conclusively showed that such representation can be easily adapted to different product and service domains. To overcome the problem of the scalability of the traditional autoencoders with the high-dimensional bagof-words model, Zhai and Zhang 30 proposed a semisupervised autoencoder. Specifically, they introduced supervision into the model via the loss function obtained from a linear classifier. Initially, convolutional NNs (CNNs) also used the bag-of-words representation, 6 which was the first attempt to make use of word order for sentiment analysis.

Word Embedding Models
To further improve the performance of DNNs in sentiment analysis, other studies employed vector representation models, such as Word2Vec (including continuous bag of words (CBOW) and SkipGram models), 31, 32 bidirectional encoder representations from transformers (BERT) 33 and GloVe. 34 The decisive advantage of these models is that they produce dense word / sentence/ document representations by reconstructing the linguistic context of the words. In other words, this approach takes advantage of words with a common context are located close in the vector space. Thus, the originally high dimensionality of the space can be reduced to several hundred features representing word embeddings. Tang et al. 7 employed long short-term memory (LSTM) and CNN to learn sentiment representation based on word embeddings and, consequently, gated recur-rent units (GRUs) were used to learn the document representation. They decided that word embeddings combined in a CNN model provide the best sentiment classification performance, as compared with NB and SVM. Another CNN model combined word embeddings with user preferences extracted from the consumer reviews. 8 Similarly, Chen et al. 9 exploited product and user information in an LSTM classification model equipped with word and sentence attention. To address the issue of LSTM memory unit with long texts, Xu et al. 10 developed a cached LSTM model that captures the overall semantic representation. Vector representation models were also modified with respect to sentiment polarity to improve the performance of sentiment analysis models. 78 An intriguing area in the field is sentiment classification across domains, which Li et al. 36 addressed by an end-to-end adversarial memory network. To adaptively focus on aspect-related words, Tay et al. 37 developed an aspect fusion LSTM model that ameliorates the drawback of simple word-aspect similarities. Indeed, aspect-based sentiment analysis has become increasingly popular recently. [38][39][40] Lexiconand corpus-based sentiment scores were assigned to aspects identified by pre-defined lexicons. 41

Combinations of Word Representation Models
Another challenging task is combining different word representations. Context features, including word location, part-of-speech (POS) and sentiment score, can append embedding representations in the feature-based compositing memory networks, showing that ignoring words without sentiment is more effective than document representations without context features. 42 Zhang et al. 43 a cross-modality consistent regression model to take advantage of three different CNNs used to model semantic, sentiment and lexicon representations. Lexicon and sentiment representations reportedly address the disadvantages of semantic word embeddings in sentiment analysis. 43 However, the word embedding representations used in prior studies ignore the sentiment polarity / intensity of the words. Consequently, words with different sentiment polarity are combined in one feature, which may limit the classification performance of machine learning methods in sentiment analysis tasks. In other words, this may lead to the misrepresentation of documents in the context of senti-ment analysis. Moreover, hybrid representation models combining word embeddings with different sentiment and semantic representations may further improve classification performance in related tasks due to highly domain-specific context. 44,45 Product and service reviews from different domains represent exactly such a task. Inspired by these observations, the original contribution of this study is the proposal of a DNN model integrating word embeddings with lexicon-based sentiment and emotion features. Notably, the proposed word-emotion associations enable us to obtain both the meaning and sentiment polarity / intensity / emotions of the words in the online comment representation. In agreement with earlier research, 46 the proposed model considers different topics by extracting latent features from the word representation. Finally, the used bag-of-words representation utilizes a supervised term-weighting scheme. The discriminative power of terms was also considered in previous studies, 46 but learning term weights during the neural network training process turned out to be prone to overfitting and highly time-consuming for high-dimensional data. 47 Therefore, the proposed model considers the discriminative power of terms as early as in the bag-of-words representation.

Neural Network Model
Fig. 1 depicts the architecture of the proposed DNN model with word-emotion associations, topic modeling and bag-of-words (BoW) component selected using supervised term weighting for the sentiment analysis of online comments. A DNN with convolutional, pooling and two dense hidden layers was used to capture high-level features from the hybrid document representation obtained from the word-emotion representation, topic modeling and BoW representation.

Word-Emotion Representation
The word-emotion representation is produced in two stages. In the first stage, the Skip-Gram model 31 is trained to obtain word embeddings. This model was used because it is reportedly more effective than its competitors in exploiting the word context. 31 Unlike the CBOW model, the target word is used as input while the context words represent the output layer in the Skip-Gram model. In the second stage, the  49 characters Temporal CNN 4M Amazon product reviews Du (2016) 20 word-aspect CBOW CNN ∼44M Amazon product reviews Chen (2016) 9 SkipGram, user/product specific words Hierarchical LSTM 231K Yelp reviews Poria (2016) 50 CBOW, POS CNN 7.7K reviews from the SemEval 2014 dataset Zhai (2016) 30 BoW Semi-supervised Autoencoder 62K Amazon product reviews  37 GloVe LSTM 17K reviews from the SemEval 2014 and 2015 datasets Rathor (2018) 57 weighted unigrams SVM 24.5K Amazon product reviews Asghar (2019) 41 BoW, POS Lexicon-and corpusbased SentiWordNet 84K sentences from electronic product reviews Gamal (2019) 58 n-grams with tf.idf weights PA, RR 1K Amazon product reviews Huang (2019) 59 Word2Vec CNN + RNN ∼500K Amazon fine food reviews Jagdale (2019) 60 BoW, sentiment score SVM, NB 12K Amazon product reviews Kausar (2019) 5 BoW, POS, SentiWordNet sentiment score RF, DT, NB, SVM, Gradient Boosting, LSTM 31K Amazon product reviews Ma (2019) 42 Location, POS, NRC Hashtag sentiment score FCMN 8K reviews from the SemEval 2014 dataset Riaz (2019) 3 Sentiment strength, keyword extraction, tf.idf weights k-means 1.2M product reviews from Amazon, eBay and Alibaba Mandhula (2020) 46 keyword extraction using LDA CNN + LSTM ∼35M Amazon product reviews Miao (2020) 4 BERT Semi-supervised learning 71K reviews from the SemEval 2014 dataset This study Word-emotion associations, topic modeling using LDA, supervised tf.idfbased BoW CNN 400K Amazon product reviews, 515K hotel reviews BERT -bidirectional encoder representations from transformers, BoW -bag-of-words, CNN -convolutional neural network, CBOW -continuous bag of words, CRF -conditional random field, DT -decision tree, FCMN -feature-based compositing memory network, LDA -latent Dirichlet allocation, LSTM -long short-term memory, NB -naïve Bayes, PA -passive aggressive, POS -part-of-speech tagging, RF -random forest, RR -ridge regression, RNN -recurrent neural network, and SVM -support vector machine. vocabulary generated from the corpus of consumer reviews is compared with sentiment-based lexicons to identify various sentiment polarity and sentiment intensity features.
To generate the embedding weight matrix, the embedding function is learnt and applied to each word w t in the vocabulary. The embedding function is produced for the sequence of training words W = {w 1 , w 2 , . . . , w t , . . . , w T } so that the following loss function is maximized: where c denotes the context window radius (the number of surrounding words examined); and p(w t+j |w t ) is the probability of the output target word given the input context words, calculated using the hierarchical softmax algorithm as follows: where w I and w O represent the input and output words, respectively; ν w and ν w are the vector rep-resentations of the input and output words, respectively; n(w, j) represents the j-th node in the binary tree; L(w) denotes the path length in the tree; ch(n) is a child node; and σ(x) is a sigmoidal function (if x is true, then [x] = 1; otherwise [x] = −1). This can produce good embeddings by maximizing the loss function E, i.e., similar words have similar vectors. Words are represented by leaf nodes in the binary tree, and the tree structure substantially reduces the complexity by decomposing the probability calculations to at most L(w) nodes. To generate the word tree, the Huffman-based approach was used. 31 The hyper-parameters in the model were set as follows: learning rate = 0.025, window size = 5 and word vector dimensionality = {100, 200, 400}.
To complement the word-emotion representation with the sentiment polarity and intensity of the words, several existing sentiment lexicons were used. To obtain a reliable lexicon-based emotion evaluation, it is best not to rely on a single lexicon. 61 In addition, the combination of various lexicon-based emotion indicators ensures wider lexical coverage and addresses the issue of susceptibility to indirect opinions, typically present in the machine learning-based models. 61 To calculate sentiment polarity, two handcrafted lexicons of positive and negative words were used: Bing Liu's opinion lexicon 62 and OpinionFinder. 61 OpinionFinder is an annotated extended edition of the Multi-Perspective Question-Answering data. Bing Liu's opinion lexicon also includes slang and misspelled words, which results in this lexicon being more unique than the OpinionFinder lexicon. 63 One disadvantage of these lexicons is that equal weights are assigned to all words irrespective of their sentiment intensity. To overcome this problem, the sentiment intensity indicators from several pretrained lexicons 61, 64 were incorporated: (1) Senti-WordNet, (2) Sentiment140, (3) NRC Hashtag, and (4) AFINN. SentiWordNet extends the well-known WordNet database by annotating each synset with scores of positivity, neutrality and negativity in the range [0,1]. This annotation was performed automatically using a semi-supervised algorithm. Sen-timent140 and NRC Hashtag are lexicons generated automatically from words with emotional tags. More precisely, the sentiment scores of Sentiment140 use positive and negative emoticons, while the NRC Hashtag lexicon uses positive and negative hashtags. Sentiment scores are obtained ranging from -5 to 5 for NRC Hashtag based on the point-wise mutual information between each word and the polarity of the corresponding message. The NRC emotion-based lexicon was considered that covers eight emotions (anger, anticipation, disgust, fear, joy, sadness, surprise, and trust) adopted from the Plutchik wheel. 65 Again, these word lists are the result of human-based tagging. Finally, the list of emoticons was taken from the AFINN lexicon. 66 The list of the lexicon-based features we used is presented in Table 2, showing their source and description. Detailed description of the calculation of the sentiment polarity and intensity features is given in the original papers. 61, 64

Latent Dirichlet Allocation
LDA represents an enhanced generative topic model, in which documents are multinomial distributions of latent topics (mixtures of words). 67 Topics, on the other hand, are represented by word distributions. Each document is generated by a two-step probabilistic process. First, the word probability distribution Φ k is sampled for the k-th topic in the Dirichlet distribution Dir(β) with the topic parameter β. Second, the topic probability distribution θ j is sampled for the j-th document in the Dirichlet distribution Dir(α), where the latent variable z n follows a multinomial distribution θ. Given the parameters α and β, which determine the Dirichlet priors on θ and Φ, the joint probability distribution is as follows: where word w n is generated from the multinomial distribution and the model parameters p(θ|α), p(z n |θ) and p(w n |z n , β) can be estimated by optimization algorithms. In this study, the collapsed variational Bayes approximation (with iteration limit = 100, data pass limit = 1, mini-batch size = 1,000, and learning rate decay = 0.5) was used due to its faster convergence rate compared with collapsed Gibbs sampling. 68 In agreement with previous studies, 69 only verbs and nouns were used for topic modeling. The use of these features is justified because most aspect terms are nouns or noun chunks. 50 The Stanford Tagger was employed for POS tagging. In addition to topic probabilities identified using LDA, we followed Poria at al. 50 and used six POS tags (noun, verb, adverb, conjunction, adjective, and preposition), calculated as the absolute frequencies of the terms selected using the supervised term weighting scheme.

Supervised Term Weighting Scheme for Sentiment Analysis
Let D 1 and D 2 be the sets of documents of positive and negative opinion classes, respectively. The j-th document d j is represented by a vector of term weights d j = (w j1 , w j2 , . . . , w jm ), defined for terms f 1 , f 2 , . . . , f m . In the used supervised weighting scheme, w ij is calculated as follows: where where x k i is the number of documents from D k that contain the term f i , y k i is the number of documents that do not belong to D k that contain the term f i , λ is the ratio between frequency and odds, and N k is the number of documents in D k . Following extensive experiments performed by Deng et al. 47 on multiple sentiment analysis datasets, the value of the hyper-parameter λ was set to 0.1 in this study to ensure stability between frequency and odds. To obtain IT S(f i ), the maximum of W F O for the positive and negative sentiment classes was calculated. For further processing, the terms f i were ranked according to their weights w ij , and selected the top n=1,000 terms 71 to enter the document representation layer.

Training the Neural Network Model
The DNN model comprises one convolution layer with 50 feature maps (filters) with filter size 3, fol-lowed by a max-pooling layer of size 2. The maximum numbers of words in the reviews were used to fix the size of the inputs. The next two hidden layers in the DNN architecture are used to process the complex relationship between the document representation and the outputted positive / negative sentiment class of the online comment. To avoid overfitting and make the training more effective, dropout regularization was applied with dropout rates of 0.2 and 0.5 for the input and hidden layers, respectively. Rectified linear units (ReLU) were used to represent the convolutional and dense hidden layers. Training the DNN using the mini-batch gradient descent algorithm with b = 100 mini-batches, a learning rate of 0.1 and I = 1,000 iterations provided us with stable convergence and computationally efficient behavior. Cross-entropy loss was used as the objective function. Different numbers of filters were tested in the convolutional layer = {20, 50, 100}, and n h1 and n h2 of ReLU in the two dense layers = {2 4 , 2 5 , . . . , 2 9 } to obtain the optimal DNN architecture. Experiments for two convolutional and one / three dense layers were performed, but without improvement. The results for these architectures are not presented in this study due to space limitations. The computational complexity of the proposed DNN model can be expressed as O(b × I × (k × n × d 2 + m × n h1 + n h1 × n h2 + n h2 × n O )), where k is the length of the filter in the convolutional layer, n is the sequence length in the convolutional layer, d is word vector dimensionality, m is the number of features in the document representation layer and n h1 , n h2 and n O denote the numbers of neurons in the dense and output layers, respectively. This implies that the number of iterations, word vector dimensionality and number of terms in the bag-of-words model are the most the most important determinants of the computational complexity of the proposed model.

Data and Preprocessing
For the experiments, two large datasets were used, namely the Amazon and Hotel review datasets openly accessible on Kaggle ab . The Amazon dataset was provided by Xiang Zhang and originally used to classify the sentiment of consumer reviews using temporal CNNs with character-level features. 49 The dataset has been gradually expanded within the Stanford Network Analysis Project since 1994, 72 currently resulting in ∼34 million reviews from ∼6.6 million users on ∼2.4 million products. The mean character length of the consumer reviews in the dataset was 764 (90.9 words). Extremely short and long consumer reviews were discarded, and duplicates were removed by Xiang Zhang. Users' rating scores serve to categorize the consumer reviews into positive and negative sentiment orientation. More precisely, scores 1 and 2 indicate negative sentiment, whereas 4 and 5 scores indicate positive sentiment. To evaluate the effectiveness of the proposed DNN model, the testing data from the original dataset was used in this study, represented by 400,000 consumer reviews evenly distributed into positive and negative sentiment classes. The text of the reviews was represented by review title and review content. Regarding the Hotel review dataset, of 515,738 customer reviews in total, 485,035 were negative (with overall ratings < 5) and 30,703 were negative (with overall ratings ≥ 5). In other words, the Hotel review dataset was imbalanced 15.8 to 1 in favour of negative reviews. It is worth noting that experiments were performed with random undersampling and oversampling to address the data imbalance problem but without improve-ment in accuracy. The mean number of words for this dataset was 35.6. In the text pre-processing stage, we carried out tokenization (using the following delimiters: ".,;:'"()?!") and transformation to lowercase letters. A prefix was also added to words occurring in negated contexts in case of the bag-of-words model.

Experimental Results
First, the effectiveness of each component of the proposed document representation model was investigated. Two evaluation measures were considered, namely accuracy (Acc = (true positives + true negatives) / (true positives + true negatives + false positives + false negatives)), and area under receiver operating characteristic curve (AUC). To evaluate classification performance, the datasets were divided into training and testing sets containing 80% and 20% of data instances, respectively. This data split proved to be effective for deep learning methods in sentiment analysis. 73 Stratified split was applied to maintain the sentiment class prevalence between data splits. To ensure reliable and consistent results, this procedure was repeated ten times; the mean values and standard deviations are presented for the testing set.
To obtain word embeddings, the Skip-Gram model was trained on the original Amazon product review dataset with ∼34 million reviews for the Amazon dataset, while the Hotel review dataset was used to produce word embeddings for the hospitality domain. Fig. 2 illustrates that different settings of the Skip-Gram model were examined. The best performance was achieved with 200 and 400 word embeddings for the Amazon dataset and Hotel dataset, respectively. We trained the Skip-Gram model in the Deeplearning4j environment (distributed, opensource DNN library written for Java, compatible with Clojure and Scala, and integrated with distributed computing frameworks Hadoop and Apache Spark). a https://www.kaggle.com/bittlingmayer/amazonreviews b https://www.kaggle.com/jiashenliu/515k-hotel-reviews-data-in-europe   As shown in Fig. 3, the positive (negative) sentiment polarity / intensity indicators of the consumer reviews in the positive sentiment class have higher (lower) mean values than those in the negative class. Fig. 4 illustrates that reviews in the positive class are characterized by higher values of positive engagement (joy, anticipation, trust and surprise), whereas the negative class is distinguished by emotions with negative engagement, such as sadness, fear, disgust and anger. The mean values of the emoticon positive and negative scores for the positive class were 0.027 and -0.013, respectively. In contrast, it was only 0.006 and -0.021 for the negative class. Overall, these results indicate the valuable role of sentiment-and emotion-based indicators in the sentiment analysis of consumer reviews. Similar results were observed for Hotel reviews. To calculate the values of the sentiment polarity / intensity and emotion-based indicators, we used the AffectiveTweets package.  The LDA model was trained using the collapsed variational Bayes approximation, implemented in the Text Analytics Toolbox using Matlab 2019b. The maximum number of iterations was set to 1,000. To select the appropriate number of topics in LDA, different numbers of topics were examined in the range {5, 10, . . . , 60}. A tuning procedure was employed to minimize the LDA model's perplexity on 10% of held-out data. Fig. 5 shows the minimum validation perplexity was achieved for thirty and five topics, respectively; therefore, the number of topics was K=30 for the Amazon dataset and K=5 for the Hotel dataset. For the latter, the generated word clouds indicated that the five topics represented hotel food, staff, location, hotel services, and room quality.
Regarding the discriminative power of terms, terms with strong sentiment engagement were selected, ranked for the Amazon dataset as follows: "great," "waste," "money," "love," "worst," "poor," "excellent," "bad," "disappointed," etc. This suggests that such a weighting scheme is appropriate for the sentiment analysis of consumer reviews. Fig.  6 illustrates the effectiveness of the supervised term weighting scheme for the bag-of-words (BoW) representation. Traditional tf.idf (term frequency -inverse document frequency) weights for the top 1,000 n-grams (unigrams, bigrams and trigrams) were used for comparison. 75 The above results indicate the separate effectiveness of the three document representation components. In a further set of experiments, the synergic effects of combining these components into an integrated model were investigated.
The quality of the proposed models were evaluated using the Acc and AUC evaluation measures. Since the examined variables had a normal distribution (Kolmogorov-Smirnov test: N = 10, max D < 0.324 (0.381), p > 0.05), parametric tests for repeated measures were used. The Mauchley sphericity test was used to verify the sphericity assumption for repeated measures with five levels (  (5) DNN unadj.BoW +T M +W E : DNN BoW +T M +W E with unadjusted BoW term weights). For both datasets, the test was significant (Acc: p = 0.0127; AUC: p = 0.1059 for the Amazon dataset, and Acc: p = 0.3067; AUC: p = 0.00005 for the Hotel dataset). The assumption was violated, indicating that the type I error increases. The degrees of freedom were adjusted using Greenhouse-Geisser and Huynh-Feldt adjustments for the F -test to achieve the declared level of significance. The results showed that the null hypotheses, that there is no statistically significant difference in the values of the evaluation measures between the investigated models, were rejected at the 0.001 significance level. After rejecting the global null hypotheses, statistically significant differences in performance were tested between models. For multiple comparisons, the Newman-Keuls test was used, which has more power than common post-hoc tests. From multiple comparisons based on Acc for the Hotel dataset, only one homogeneous group was identified: DNN-WE and DNN unadj.BoW +T M +W E performed the same (p > 0.05). Statistically significant differences in performance between all investigated models were identified for both evaluation measures in other cases (p < 0.05). DNN BoW +T M +W E models with unadjusted as well as adjusted tf.idf achieved high quality. Fig. 7 and Fig. 8 show that the DNN model using the topic modeling component had the worst performance. More precisely, the DNNs with wordemotion representation and supervised term weights increased accuracy by 17.4% and 1.1%, respectively, compared with DNN-TM. DNN-WE and DNN-BoW performed similarly in terms of both the evaluation measures. The DNN BoW +T M +W E model performed best with a 5.1% and 0.5% increase in accuracy compared with the DNN-BoW model for the Amazon dataset and Hotel review dataset, respectively. Overall, strong evidence of the effectiveness of the combination of the three components was found. Further statistical tests revealed that DNN BoW +T M +W E performed significantly better than the baseline models at p < 0.01. Other statistical tests (Friedman ANOVA and multiple comparisons based upon the mean rank differences) also revealed that DNN BoW +T M +W E performed significantly better than the baseline models at p < 0.01. The results of the parametric and nonparametric approaches agree and can be considered robust. To comprehensively evaluate the effectiveness of the proposed DNN models, their performance was compared against the following existing models, used in earlier studies on the sentiment analysis of consumer reviews: • Improved NB (INB-1) 48 accommodates word sentiment using the SentiWordNet lexicon in the feature extraction process. Like Kang, 48 unigrams, bigrams and sentiment patterns were extracted. • LSTM 7 and CNN 7 were used to capture semantic sentence-level representation. In agreement with Chen et al., 9 the dimension of hidden / cell states in LSTM was set to 200, corresponding to the number of word embeddings. The CNN model comprised the convolutional layer with five filters of size 5 and a max pooling layer of size 4. For both models, the sentence representation was fixed using the number of words in the longest review. The document representation for both models was generated as a composition of sentence representations using GRUs. Stochastic gradient descent with an Adam optimizer was the learning algorithm used to train both models in the Deeplearn-ing4j environment. • Aspect-specific sentiment word embedding (AS-SWE) 20 is based on the CBOW model generated for each word-aspect pair. LDA was trained with the collapsed Gibbs sampling algorithm to assign aspects to each term. The remaining training parameters of the CNN model were the same as in the previous comparative model. • The CNN+LP (linguistic pattern) model 50 is also based on the pretrained CBOW model. In addition, six basic POS tags were used as input features. Again, the CNN+LP model was trained using the Deeplearning4j environment. • The ensemble classifier model NB + SVM + Bagging combines three baseline classifiers, namely NB, SVM and bagging. 56 Following the original study, unigrams were used as input features and voting was employed as the meta-classifier to obtain the final review classification. • The aspect-based NB (ANB) model 55 uses three types of input features, namely POS tags (obtained using the Stanford CoreNLP library) and two bags-of-words containing, respectively, aspectand sentiment polarity-related words. The Chisquare feature selection algorithm was used to reduce the dimensionality of the word representation, and the NB classifier was employed to classify the product reviews into the sentiment categories. • The ridge regression (RR) classifier uses the top 1,000 n-grams according to their tf.idf weights. 58 The RR model was selected because it performed best in the original study for the Amazon product dataset, as compared with different machine learning algorithms, such as SVM, NB, AdaBoost and logistic regression. • The NB classifier uses sentiment polarity scores at sentence level (NB-SPS). 60 The SentiWordNet lexicon was used to calculate the positive and negative polarity of each sentence. • An SVM with word sense disambiguation (SVM-WSD) 5 utilizes adverbs scored using the Sen-tiWordNet lexicon as input features. Adverbs were assigned positive and negative SentiWordNet scores, and the SVM was trained using the Lib-LINEAR library. The L2-regularized L2-loss SVM model was trained with the cost parameter C=1.  Table 3 shows the results of DNN BoW +T M +W E in comparison with the above sentiment analysis models. Remarkably, the proposed model not only performs best in terms of all the evaluation measures used but also performs significantly better at p < 0.01 using nonparametric approaches (Friedman ANOVA and multiple comparisons based upon the mean rank differences), which emphasizes the validity of the proposed model and the robustness of the achieved results. SVM-WSD also performed well in terms of accuracy, especially when considering its low computational time.
Following previous studies, 45 the testing time criterion (measured as wall-clock time per 1,000 reviews) was adopted to demonstrate the real-time capacity of consumer review classifiers. The proposed DNN BoW +T M +W E model performed the worst in terms of time efficiency, but it can still be considered time efficient, with approximately 2,300 consumer reviews classified per second. The average training time of the proposed DNN model was approximately 1,650 s and 2,000 s for the Amazon product review dataset and Hotel review dataset, respectively. Recall that the crucial determinant of the computational complexity is the word vector dimensionality, leading to higher complexity for the Hotel dataset. Moreover, better time efficiency can be expected with a decrease in the number of n-grams. Overall, the model performed well for both sentiment classes, as indicated by the high value of AUC.
To verify the effectiveness of the proposed models, adjusted tests for repeated measures were used. Epsilon represents the degree to which sphericity has been violated. When comparing the proposed models to the existing ones, the Epsilon values were considerably less than one. The null hypotheses was thus rejected, claiming that there is no statistically significant difference in the values of Acc and AUC among the investigated models at p < 0.001.
In terms of multiple comparisons, one-sided tests were used to examine the effectiveness of the individual proposed models against existing models, i.e., many-to-one comparisons (existing models to proposed DNN model). The Dunnett test was used, which tests a null hypothesis -there is no statistically significant difference in efficiency (model performance) between the proposed model and existing models.
For the DNN-WE model, the null hypotheses was rejected for the existing ANB, 55 INB-1, 48 NB-SPS 60 and SVM-WSD 5 models, based on both evaluation measures at p < 0.001, i.e., the DNN-WE model was more efficient than the existing ANB 55 and INB-1 48 models for both datasets. This supports the dominance of word embedding models over bagof-words models reported in earlier studies. 26,27 Similarly, the DNN-BoW was more efficient than the existing ANB, 55 INB-1, 48 NB-SPS, 60 SVM-WSD 5 and NB+SVM+Bagging 56 models, based on both evalu- ation measures (p < 0.001). This can be attributed to the more effective feature selection in the bagof-words model. Note that this improvement was observed mainly for the Amazon dataset with sufficient number of instances in both sentiment classes. Based on the evaluation measures, the DNN BoW +T M +W E models with the unadjusted as well as adjusted tf.idf had the highest quality compared to existing models. The null hypotheses was rejected for all existing models at p < 0.001 for the Amazon dataset. For the Hotel dataset, the DNN BoW +T M +W E models significantly outperformed most of the existing models except CNN, 7 ASSWE, 20 CNN+LP 50 and LSTM. 7 This can be explained by more effective learning of word embeddings in case of generally shorter hotel reviews.

Conclusion
This study proposes an efficient DNN model integrating word-emotion associations, topic modeling and supervised term weighting for the sentiment analysis of online comments. The DNN model is proved to perform better than baseline document representations for the Amazon product review and hotel review datasets, irrespective of the difference in their class imbalance ratio. The average value of sentiment classification accuracy of the proposed model was 91.0% and 95.1%. The improvement over the baseline document representations was achieved through the integrated representation. Compared with the baseline representations, the proposed model allowed us to increase Acc by on average 4.3% and 0.3%, respectively.
The proposed DNN-WEAE model was compared with ten state-of-the-art sentiment analysis methods combining sentiment analysis and topic modeling in different ways. In contrast to those approaches, this study considered various sentiment polarity / intensity and emotion indicators in wordemotion representation. In addition, the proposed model utilized a supervised term weighting scheme to improve BoW selection. The combination of these components performed best, indicating that the combination of a low-dimensional dense representation of word embeddings and high-dimensional sparse representation of BoW with high discriminative power caused the improved performance. However, such a document representation model leads to a partly sparse dataset, which necessitates further requirements for the sentiment classification methods. It was demonstrated that the proposed DNN model can handle such a document representation. The average AUC performance of existing CNN and LSTM architectures was improved using the proposed DNN model by 3% for the Amazon dataset, while no improvement was obtained for the Hotel dataset. This can be attributed to the reduced effect of supervised term weighting scheme in presence of limited number of reviews in one of sentiment classes.
Future research should investigate the wordemotion associations directly at the entity / aspect level, rather than separately. A limitation of the proposed model is that it captures only local features. Therefore, future studies should investigate alternative DNN models with attention mechanisms. More research is also needed to better understand the cross-domain modifications of the model. To improve the understanding of sentences, recently developed pattern-based methods can be used. 77 Alternative embedding-based schemes, such as GloVe, fast-Text, Sentence-BERT, Universal Sentence Encoder and Word Mover's Embedding, can serve to generate word-emotion association. The proposed model should also be used in multi-class sentiment analysis, and new powerful supervised machine learning methods should be employed to automate the design of neural network models, such as neural dynamic classification 78 and dynamic ensembles of neural networks. 79 Finally, the time efficiency of the model can be improved using specialized TPU accelerators.