Sentiment analysis for Finnish Internet comments - and a point about lemmatization

I've recently been learning me some basics about both data science and machine learning. I decided to up the challenge a bit and tried out implementing sentiment analysis for a corpus of Finnish social media comments.

So I started by looking for a suitable data set. I found a set of 27,000 comments collected by researchers at Turku University. This data set is available via Kielipankki. Search for "FinnSentiment".

I utilized various online resources to learn more about natural language processing and sentiment analysis. Containing examples with Keras, an easy tutorial for sentiment analysis at Towards Data Science was extremely helpful.

Setting up & generating predictions

First I removed all special characters and URLs from the data and changed it all to lower case.

url_pattern = re.compile(r'https?://\S+|www\.\S+')
data["text"] = data["text"].apply(lambda text: re.sub(r"[^a-zA-Z0-9ÄÖÅäöå ]","", text.lower()).strip())
data["text"] = data["text"].apply(lambda text: url_pattern.sub(r'', text))

I tokenized and vectorized the data:

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(data["text"])
sequences = tokenizer.texts_to_sequences(data["text"])
sequences = pad_sequences(sequences, maxlen=max_len)

And yes, I'm using Pandas dataframes a lot, hence referring to the data with data["text"].

I started with the model from the aforementioned tutorial source and eventually wound up with a slightly modified version:

model = Sequential()
model.add(layers.Embedding(max_words, 40, input_length=max_len))
model.add(layers.Bidirectional(layers.LSTM(20, dropout=0.4)))
model.add(layers.Dense(3,activation='softmax'))
opt = optimizers.RMSprop(learning_rate=0.001, momentum=0.0)
model.compile(optimizer=opt,loss='categorical_crossentropy', metrics=['accuracy'])
checkpoint = ModelCheckpoint("best.hdf5", monitor='val_accuracy', verbose=1,save_best_only=True, mode='auto', period=1,save_weights_only=False)
history = model.fit(X_train, y_train, epochs=25,validation_data=(X_test, y_test),callbacks=[checkpoint], class_weight=class_weight)

After 5 epochs, accuracy was 75% but the confusion matrix looked like this:

So clearly there's a bit of a balance issue. Unfortunately the data contains far more "neutral" rows than "positive" or "negative" rows.

I calculated custom class weights according to the representation of the each label and applied that:

class_weight = {0: (total/(3*neegatives)),
                1: total/(3*neutrals),
                2: total/(3*positives)}
[..]
model2.fit(X_train, y_train, epochs=5,validation_data=(X_test, y_test),callbacks=[checkpoint], class_weight=class_weight)

And now I ended up with this confusion matrix:

It's a somewhat decent outcome. Clearly the model has learned something, even though it's still far from perfect.

Encouragingly, when we look at negative samples that were predicted as positives and positive samples that were predicted as negatives, we have quite good results; 8% false positives for negative samples and 11% false negatives for positives.

Next I tried stemming and lemmatization. I used Snowball for stemming and Voikko for lemmatization. Voikko requires a bit more installing so be sure to follow the instructions on the website if you want to use it.

I also tried to use a long list of stopwords, that I remove from the data, but this didn't really work out all that well - I think the lists there have a bit too many words removed and many of my entries were left with very little actual content. Some of those words also definitely are biased towards certain kind of emotional contexts. Therefore I opted to use my own very short list of stopwords, composed of a few of the most common words, such as "olla" (to be), "ja" (and), "minä" (I), etc.

With lemmatization and stemming words that aren't recognized for lemmatization, I ended up with this confusion matrix:

Maybe a small overall improvement, but hard to say since negatives losts some accuracy.

With no stemming and only lemmatization, I wound up with:

And with only stemming, the result was:

Interestingly it made mostly no difference whether I used lemmatization or stemming!

I'm not really sure how good the stemming and lemmatization libraries for Finnish are, given that Finnish is a quite small language. It is quite possible that the libraries just aren't producing good enough results as of now. On the other hand, it is also possible that the libraries work fine, but that there just isn't good enough data available for my models (or my models might suck, which is a distinct option as well).

Regarding that..

There clearly went a lot of effort into collecting the data by the researchers. But I do feel that we still would need a bit more to produce great sentiment analysis.

There are also some problems with Finnish compared to English. The Finnish language obviously has very complex inflection rules compared to English. The word "koira" (dog) can also be "koiramme" (our dogs), "koiranne" (your dogs), "koiratko?" (the dogs? even the dogs?), "koiramaista" (dog-like, typically as in dog-like behavior), "koirinensa" (with their dog) etc.

Finnish also has somewhat longer average word length than English; An average of 7.5 characters against 5 characters.

And finally Finnish has a lot of compound words. English has quite many, too, but overall Finnish does come ahead.

Now back to the data itself - it seems that both comments that express aggression and dislike and comments that express a possibly negative emotion and comments that have certain negative keywords were labelled as negative; so the comment "Surullista" (sad), the comment "Äänekäs sovinistiörkki! ;)" (loud chauvinist pig! ;), probably jokingly said) and "Meilläkää ei hirveesti nekruja näy,, mutta ei niitä kaivatakkaa." (racist comment with the n-word) were labelled negative. If I wanted to detect hate speech, only one of those would be labelled as hate speech. Meanwhile I am fairly sure that you can't consider a single word representing a single common emotion as negative. Not without more context, anyway.

There are also some where I just don't really agree with the labeling. For example a neutral question "Missä kohtaa olen sinua nimitellyt?" (Where did I name-call you?) is labeled as negative by all three reviewers. Then there's some weird positives like "naurettavaa :D" (ridiculous, and typically not in a good way).

I don't want to be too critical about the data though, I think the researchers overall did a great job and adding to the Finnish text data sets is extremely important. I merely think that by adding more data - and perhaps ever so slightly improving the quality of the existing data - we could end up with better results.

Regarding the results, here's the full paper by the researchers.

I think I ended up beating their CNN model, though just in case I am misinterpreting the results in that paper, I'll not open any champagne bottles quite yet.

One more thing - from multi-class to binary

I had the additional idea of trying to classify the results on a binary scale.

Let's say, for example, that I wanted a model that at the minimum can let half of comments through immediately while putting half of the comments to a moderation queue, which a human moderator needs to look at.

I changed my last layer from 3 neurons and softmax activation to 1 neuron and sigmoid activation, reformatted the data accordingly, and started playing around with it.

The results weren't that great, among the best results was this:

So here "negative" are those where all 3 reviewers agreed that the comment is negative.

In this system, 19 negative comments out of 105 would have slipped through the moderation queue. That's not even near good enough, unfortunately.

Here's one where two reviewers agreeing was enough to label a comment as negative:

Not very good either.

But then, this data wasn't exactly labeled with this use in mind. So not really a big surprise that this didn't end up working all that well.

How I split the predictions into half was to get the median value of the predictions and assume that above median was positive and below median was negative.

In a business case, this would mean that half of the comments go to moderation and half are immediately let in without moderation.

I do think that this kind of AI-enchanced use is much better than relying solely on AI.. AIs still make a lot of mistakes and they also carry gender and racial and other biases. In an ideal world, maybe something like 2% of comments are automatically blocked by AI; 30% of comments go to a human moderation queue; and the rest go through automatically. This way human moderator workload would be significantly reduced, but very few if any comments would be unfairly blocked by the AI.

Final thoughts

I didn't really succeed with this project in the sense that I had produced something good enough for a business use case.

I did, however, manage to produce a model that somewhat worked, and I do think that with more time spent tuning the parameters and cleaning the data better results could be achieved. I also learned a lot as I went.

But for truly good results, I think more data is needed and more specific business use cases need to be defined, and the data labeled according to those specifications.

If any more experienced machine learning engineer has any insights to this, feel free to comment them below, I would appreciate it! 🙏

Thanks and cyas.