Researchers from the Institute of Strategic Dialogue (ISD) analysed the response of four popular chatbots (ChatGPT, Gemini, Grok and DeepSeek) to a range of questions in English, Spanish, French, German and Italian on topics related to the Russian invasion of Ukraine.
Almost one-fifth of responses cited Russian state-attributed sources, many of them sanctioned in the EU. Questions biased in favour of Russia were more likely to include these sources in responses, as did queries related to Ukraine’s military conscription of civilians and the perception of NATO. Some chatbots struggled to identify state-affiliated content especially when it had been disseminated by third-party outlets or websites.
With close to 45 million users in the EU, ChatGPT is close to reaching the threshold for a higher level of regulatory scrutiny from the European Commission as a Very Large Online Search Engine (VLOSE) under the Digital Services Act, the research firm writes.
Key points of the research:
- ISD tested 300 queries in five languages and Russian state-attributed content appeared in 18 percent of responses. These included citations of Russian state media, sites tied to Russian intelligence agencies, and sites known to be involved in Russian information operations that were surfaced during prior research into chatbot responses.
- Almost a quarter of malicious queries designed to elicit pro-Russian views included Kremlin-attributed sources compared to just over 10 percent with neutral queries. This suggests LLMs could be manipulated to reinforce pro-Russia viewpoints rather than promoting verified information from legitimate sources.
- Among all chatbots, ChatGPT cited the most Russian sources and was most influenced by biased queries. Grok, meanwhile, often linked to Russian-aligned but non–state-affiliated accounts amplifying pro-Kremlin narratives. Individual DeepSeek responses sometimes produced large volumes of state-attributed content, while Google-owned Gemini frequently displayed safety warnings for similar prompts.
- Some topics surfaced more Russian state-attributed sources than others. For instance, questions about peace talks resulted in twice as many citations of state-attributed sources as questions about Ukrainian refugees. This suggests that LLM safeguards may vary in effectiveness depending on the specific topic.
- The language used in queries had limited impact on the likelihood of LLMs citing Russian state-attributed sources. While each model responded differently, the sources surfaced to users were roughly similar across the five languages tested. Spanish and Italian prompted Russian sources that were mostly in English, which appeared in 12 results out of 60, compared to 9 of 60 for German and French (the languages with the lowest rates)
Not a surprise to me.
ChatGPT is so sycophantic it makes me ill using it. Its primary objective seems to be to please the user, which helps its 'score' well in many tests.
Gemini used to not be so bad, but Google deep fried 2.5 Pro/Flash, and like ChatGPT its search result usage is questionable.
Grok is just drunk on kool-aid as Musk's behest. Honestly, it hurts the model everywhere, and I feel bad for the engineers trying to make something workable without triggering their boss.
Deepseek is 100% infected by CPP propaganda, but the base model isn't that bad. The public chat portal does a lot of the censoring and search biasing, but the actual weights (and API) aren't nearly as censored: you can feel how the engineers didn't really want to censor the model. And since its open weights, you can use less 'biased' finetunes (like Perplexity and Microsoft have both made) or get around it in a number of ways.
I'd recommend GLM 4.6 to folks concerned with this instead. It’s Chinese, yes, but its 'deep research' web search is extremely good, and the model is quite sane and leveled. Just use it with its CPP censorship in mind. Even better, it’s free, dirt cheap if you need thousands of requests, and quite self hostable on desktop Ryzen/Intel (whereas Deepseek requires an EPYC/Threadripper/Xeon homelab at a bare minimum).
Mistral (from France) is alright, but the general intelligence is kinda behind others at the moment.