You are here

How language is hiding the real internet from you

When you go online, it feels like you're accessing all the world's information. But you form social media relationships based on shared language. You search Google with the language you think in. And algorithms built to maximise attention have no reason to recommend what you won't understand. So, most of the internet remains out of sight, on the other side of a language filter – and you're missing far more than content.

Most internet activity is concentrated on a small number of large platforms, and from our linguistically siloed perspectives, it's easy to assume that everyone uses them in similar ways. But why should that be true? We expect music, literature and cuisine to vary between cultures, after all, so why not the internet? 

In a new paper, our team at the University of Massachusetts Amherst's Initiative for Digital Public Infrastructure has uncovered stark differences in how different cultures harness the internet. With more research, it may reshape how we think about the services that dominate the web. We're only just beginning to understand the implications.

We may be seeing a different kind of attention economy, less about mass reach, more about small, meaningful engagement. It may be a sign of something more intimate, and perhaps even more human

 

The history of the internet offers some examples. Take the Russian social media/blogging platform LiveJournal. When it was popular in the mid-2000s, English-speaking users knew it as a space for young people to share their feelings or geek out about Harry Potter. But if you're a Russian speaker, you probably know LiveJournal very differently – as an important site of public intellectualism and political discourse, playing a rare role in hosting voices from the opposition.

With the biggest technology companies based in the US, a cultural blind spot has emerged where we often assume that the English internet is representative of the rest of the world. Research about YouTube in particular has a significant English-speaking bias – typically written in English, published in English-speaking countries and focused on English-language videos.

The internet's leading platforms are more difficult to study than you might think. Computers can blaze through text, but video is harder to parse at scale. Platforms like YouTube, the world's most popular video service, don't offer tools to create the large representative samples necessary to understand the platform as a whole, or big swaths of it like linguistic communities.

As a result, YouTube is often understood through the easily accessible tip of the iceberg: its most popular videos. Between the language bias and this popularity bias, when users, creators, academics, educators, parents, teachers and even policymakers talk about platforms like YouTube, we're typically just talking about the part that's most visible to us – a small, unrepresentative piece of it. (For more, read Thomas Germain's story on the hidden world beneath the shadows of YouTube's algorithm.)

So, how do you study what's under the surface? A couple years ago, we came up with a way to do what YouTube's tools couldn't: we randomly guessed the URLs of videos – more than 18 trillion times –  until we had enough videos to paint a picture of what's really happening on YouTube.

What we put together was a first-time look at the inner workings of one of the most influential websites on earth. With a large enough representative sample, we could begin making broader comparisons. How do videos uploaded in 2019 compare to videos uploaded in 2021? Do videos of animals get more comments than videos of sports? What kinds of things can we see when we compare popular videos to those with just a handful of views?

Most of all, we wanted to explore linguistic differences: how language and culture shape online participation at a global scale.

So, in 2024 we examined language-specific samples of English, Hindi, Russian and Spanish YouTube, working with native speakers to validate our language detection tools. Our goal was to take a high-level view of YouTube in each language to look for broad patterns. We had to acknowledge that YouTube might be just as simple as many people assume: more or less the same across languages. But that's not what we found.

Each language varies in multiple dimensions, but one corner of the platform stood out. In short, Hindi YouTube is radically different from its counterparts. It seems like Hindi users are relating to each other with rhythms and dynamics we didn't see in any other block, and buried in the numbers, we could see the story of major geopolitical conflict. 

Let's start with growth. The chart below shows how much of each language was uploaded per year from 2014 to 2023. All four are growing rapidly, but more than half of all Hindi YouTube videos were uploaded in 2023 alone. Then there's length. Spanish videos are a little longer than the rest, with a median of about two-and-a-half minutes. English isn't far behind at nearly two minutes and Russian at one minute 38 seconds. But the median Hindi YouTube video is just 29 seconds long. 

These details might sound like interesting quirks – but they're actually a reflection of India's internet history. TikTok was incredibly popular in India, long before the app exploded in the US and Europe, but that all changed after India banned the app amid border clashes with China in 2020. Overnight, hundreds of millions of users were cut off from their videos, comments, businesses and self-expression.

YouTube rushed in to fill the void, making India the first market for YouTube Shorts, a feature the company built to highlight the short-form vertical video format that made TikTok famous. It looks to have been successful. More than half of Hindi YouTube – 58% – is made up of Shorts, compared to just 25-31% for the other languages. In many countries, Shorts is just a TikTok clone, but it's become a much larger ecosystem in India. 

The influence of TikTok and Shorts shows up in other ways, too. The next chart focuses on videos 30 seconds and less, showing what portion of each language's videos are one second long, two seconds long, etc. There is a spike across all languages (though particularly extreme in Hindi) at 15 seconds, a default length for TikTok, then adopted as a default for Shorts.

Terms like "median duration by language" may seem dry, but here, they hint at a sea change in the way people use video in many parts of the world. Next, we found a telling difference in how people described their own videos. YouTube asks people to categorise their videos. Most users don't bother to change the default, People & Blogs. But when we excluded that, the differences between languages grew sharper.

You can see this in the last chart below. In Russian, gaming videos dominate. It's the most popular category in English and Spanish, too. But in Hindi, Entertainment and Education are on top. And for all the attention English-language political content gets in the popular discourse, English has the smallest number of videos in the "News and Politics" category.

These category labels are more than metadata. They're a look at how different cultures use the platform for different purposes. What we're seeing is parallel internets shaped by local needs, expectations and norms. But this data suggests something different: people in different linguistic communities aren't just making different videos and engaging with them differently, they may be using YouTube for completely different reasons.

Finally, we looked at popularity metrics – views, likes and comments – and once again, Hindi YouTube was an outlier. It demonstrated extreme inequality. Just 0.1% of Hindi videos accounted for 79% of views (the other languages ranged from 54% to 59%). But there's an interesting twist. Those less popular videos were far more likely to have likes.

That suggests something deeper. On Hindi YouTube, even the videos that aren't being seen are being appreciated and acknowledged. Our new research suggests YouTube in India may often be used like a video messaging service to talk to friends and family, with public videos often intended for a private audience. 

We think some of these differences can be explained by how the internet has been adopted in India, and the country's TikTok inheritance. This may be a different kind of attention economy, less about mass reach, more about small, meaningful engagement. It may be a sign of something more intimate, and perhaps even more human.

We still have a lot of work to do, and a lot of videos to watch, before we can make these claims definitively. But what's already clear is that language doesn't just shape your view of digital life – it can obscure the diverse, culturally specific ways people use these platforms. We're building businesses, journalism and regulation on an artificially limited view of the internet, one often filtered through English, popularity and convenience.

It's time we looked deeper.

Ryan McGrady