Abstrakt:
Fake review detection is getting crucial due to rapid growth of internet purchases. Obviously, it is important to choose the most efficient algorithm in order to detect fake (deceptive, spam) reviews either positive or negative. On the other hand, it is also important to pre-process the textual content of the reviews for training and later for production environment. A number of text preprocessing methods are examined in this study, such as feature dimensionality, tokenization, removal of stop words, stemming and different term weighting schemes. Three well-known machine learning algorithms are used as benchmark classifiers, including Naïve Bayes, neural network and support vector machine. Here we show that text preprocessing strategies are important determinants of the classifiers' performance. We find that the classifiers perform better for high-dimensional datasets represented by bigrams or trigrams selected according to the non-binary weighting scheme. Stemming and stopword removal seem to be less important.