The Effect of Text Preprocessing Strategies on Detecting Fake Consumer Reviews

Barushka, Aliaksandr; Hájek, Petr

Digitální knihovna UPCE
→
Univerzita Pardubice
→
Publikační činnost akademických pracovníků UPCE / UPCE Research Outputs
→
Zobrazit záznam

The Effect of Text Preprocessing Strategies on Detecting Fake Consumer Reviews

Barushka, Aliaksandr; Hájek, Petr

10.1145/3383902.3383908

Soubory tohoto záznamu

URI: https://hdl.handle.net/10195/77006

Datum publikování: 2019

Typ dokumentu: ConferenceObject

Zdrojový dokument: ICEBI 2019 : proceedings of the 2019 3rd International Conference on E-Business and Internet

Vydavatelská verze: https://dl.acm.org/doi/abs/10.1145/3383902.3383908#sec-terms

Název akce 3rd International Conference on E-Business and Internet, ICEBI 2019 (09.11.2019 - 11.11.2019, Praha)

Abstrakt:

Fake review detection is getting crucial due to rapid growth of internet purchases. Obviously, it is important to choose the most efficient algorithm in order to detect fake (deceptive, spam) reviews either positive or negative. On the other hand, it is also important to pre-process the textual content of the reviews for training and later for production environment. A number of text preprocessing methods are examined in this study, such as feature dimensionality, tokenization, removal of stop words, stemming and different term weighting schemes. Three well-known machine learning algorithms are used as benchmark classifiers, including Naïve Bayes, neural network and support vector machine. Here we show that text preprocessing strategies are important determinants of the classifiers' performance. We find that the classifiers perform better for high-dimensional datasets represented by bigrams or trigrams selected according to the non-binary weighting scheme. Stemming and stopword removal seem to be less important.

Zobrazit celý záznam