Digitální knihovna UPCE přechází na novou verzi. Omluvte prosím případné komplikace. / The UPCE Digital Library is migrating to a new version. We apologize for any inconvenience.

Publikace:
The Effect of Text Preprocessing Strategies on Detecting Fake Consumer Reviews

Konferenční objektOmezený přístuppeer-reviewedpostprint
Načítá se...
Náhled

Datum

Autoři

Barushka, Aliaksandr
Hájek, Petr

Název časopisu

ISSN časopisu

Název svazku

Nakladatel

ACM (Association for Computing Machinery)

Výzkumné projekty

Organizační jednotky

Číslo časopisu

Abstrakt

Fake review detection is getting crucial due to rapid growth of internet purchases. Obviously, it is important to choose the most efficient algorithm in order to detect fake (deceptive, spam) reviews either positive or negative. On the other hand, it is also important to pre-process the textual content of the reviews for training and later for production environment. A number of text preprocessing methods are examined in this study, such as feature dimensionality, tokenization, removal of stop words, stemming and different term weighting schemes. Three well-known machine learning algorithms are used as benchmark classifiers, including Naïve Bayes, neural network and support vector machine. Here we show that text preprocessing strategies are important determinants of the classifiers' performance. We find that the classifiers perform better for high-dimensional datasets represented by bigrams or trigrams selected according to the non-binary weighting scheme. Stemming and stopword removal seem to be less important.

Popis

Klíčová slova

fake, reviews, text preprocessing, bag of words, machine learning

Citace

Permanentní identifikátor

Endorsement

Review

Supplemented By

Referenced By