I always wonder what kind of algorithm is at work analyzing data, taking great amounts of personal information and deducing all those bogus things about me. For most part, this is very intransparent, but at least with deduced languages it's slightly more obvious. 1/6

So right now Twitter shows English, Indonesian, German, Russian for me. English and German are obvious and have always been on this list. The other two languages are new however, in March it was Dutch instead. Russian was apparently added because I tweeted in Russian lately. 2/6

Show thread

The fact that Russian or Norwegian weren't on that list before means: it doesn't matter what I read. This makes sense: just because I saw a foreign-language tweet doesn't mean that I understood it or didn't use machine translation. It's only about the languages I write in. 3/6

Show thread

Dutch is also easy to explain: language recognition algorithms don't understand the text, they merely look for patterns typical for a particular language. A single typo can make English or German look like Dutch, the languages being fairly similar. 4/6

· · Web · 1 · 0 · 1

But where did it get Indonesian from? While Indonesian uses a Latin script, the language structure is very different from the languages I write. And I rarely use short phrases, so there is always enough text for the algorithm to work on. Beats me… 5/6

Show thread

I can only guess that the algorithm in use produces lots of mistakes. Normally, the mistakes don't matter – it still gets most tweets correctly, so correct languages are deduced. But occasionally it misidentifies enough tweets to cross the threshold. /end

Show thread
Sign in to participate in the conversation
Infosec Exchange

A Mastodon instance for info/cyber security-minded people.