Publikace: The Effect of Text Preprocessing Strategies on Detecting Fake Consumer Reviews
Konferenční objektOmezený přístuppeer-reviewedpostprintNačítá se...
Datum
Autoři
Barushka, Aliaksandr
Hájek, Petr
Název časopisu
ISSN časopisu
Název svazku
Nakladatel
ACM (Association for Computing Machinery)
Abstrakt
Fake review detection is getting crucial due to rapid growth of internet purchases. Obviously, it is important to choose the most efficient algorithm in order to detect fake (deceptive, spam) reviews either positive or negative. On the other hand, it is also important to pre-process the textual content of the reviews for training and later for production environment. A number of text preprocessing methods are examined in this study, such as feature dimensionality, tokenization, removal of stop words, stemming and different term weighting schemes. Three well-known machine learning algorithms are used as benchmark classifiers, including Naïve Bayes, neural network and support vector machine. Here we show that text preprocessing strategies are important determinants of the classifiers' performance. We find that the classifiers perform better for high-dimensional datasets represented by bigrams or trigrams selected according to the non-binary weighting scheme. Stemming and stopword removal seem to be less important.
Popis
Klíčová slova
fake, reviews, text preprocessing, bag of words, machine learning