I always wonder what kind of algorithm is at work analyzing #Twitter data, taking great amounts of personal information and deducing all those bogus things about me. For most part, this is very intransparent, but at least with deduced languages it's slightly more obvious. 1/6
So right now Twitter shows English, Indonesian, German, Russian for me. English and German are obvious and have always been on this list. The other two languages are new however, in March it was Dutch instead. Russian was apparently added because I tweeted in Russian lately. 2/6
Dutch is also easy to explain: language recognition algorithms don't understand the text, they merely look for patterns typical for a particular language. A single typo can make English or German look like Dutch, the languages being fairly similar. 4/6
But where did it get Indonesian from? While Indonesian uses a Latin script, the language structure is very different from the languages I write. And I rarely use short phrases, so there is always enough text for the algorithm to work on. Beats me… 5/6
I can only guess that the algorithm in use produces lots of mistakes. Normally, the mistakes don't matter – it still gets most tweets correctly, so correct languages are deduced. But occasionally it misidentifies enough tweets to cross the threshold. /end
A Mastodon instance for info/cyber security-minded people.