this post was submitted on 04 Aug 2025
18 points (100.0% liked)

Linguistics

1360 readers
1 users here now

Welcome to the community about the science of human Language!

Everyone is welcome here: from laypeople to professionals, Historical linguists to discourse analysts, structuralists to generativists.

Rules:

  1. Instance rules apply.
  2. Be reasonable, constructive, and conductive to discussion.
  3. Stay on-topic, specially for more divisive subjects. And avoid unnecessary mentioning topics and individuals prone to derail the discussion.
  4. Post sources when reasonable to do so. And when sharing links to paywalled content, provide either a short summary of the content or a freely accessible archive link.
  5. Avoid crack theories and pseudoscientific claims.
  6. Have fun!

Related communities:

Resources:

Grammar Watch - contains descriptions of the grammars of multiple languages, from the whole world.

founded 2 years ago
MODERATORS
 

Archive link: https://archive.is/20240503184140/https://www.science.org/content/article/human-speech-may-have-universal-transmission-rate-39-bits-second

Interesting excerpt:

De Boer agrees that our brains are the bottleneck. But, he says, instead of being limited by how quickly we can process information by listening, we're likely limited by how quickly we can gather our thoughts. That's because, he says, the average person can listen to audio recordings sped up to about 120%—and still have no problems with comprehension. "It really seems that the bottleneck is in putting the ideas together."

Ah, here's a link to the paper!

you are viewing a single comment's thread
view the rest of the comments
[–] lvxferre@mander.xyz 4 points 1 week ago* (last edited 1 week ago)

I linked the paper in the OP. Check page 7 - it shows the formulae they're using.

I'll illustrate the simpler one. Let's say your language allows five syllables, with the following probabilities:

  • σ₁ - appears 40% of the time, so p(σ₁) = 0.4
  • σ₂ - appears 30% of the time, so p(σ₂) = 0.3
  • σ₃ - appears 20% of the time, so p(σ₃) = 0.2
  • σ₄ - appears 8% of the time, so p(σ₄) = 0.08
  • σ₅ - appears 2% of the time, so p(σ₅) = 0.02

If you apply the first formula, here's what you get:

  • E = -∑ [p(x)*log₂(p(x))]
  • E = - { [0.4*log₂(0.4)] + [0.3*log₂(0.3)] + [0.2*log₂(0.2)] + [0.08*log₂(0.08)] + [0.02*log₂(0.02)] } = 1.91 bit
  • E = 1.91 bits

Of course, natural languages allow way more than just five syllables, so the actual number will be way higher than that. Also, since some syllables are more likely to appear after other syllables, you need the second formula - for example if your first syllable is "sand" the second one might be "wich" or "ing", but odds are it won't be "dog" (a sanddog? Messiest of the puppies. Still a good boy.)

If I pick a random word such as ‘sandwich’ and encode it in ASCII it takes 8 bytes / i.e. 64 bits. According to the scientists, a two-syllable word in English only holds 14 bits of actual information.

ASCII is extremely redundant - it uses 8 bits per letter, but if you're handling up to 32 graphemes then 5 bits is enough. And some letters won't even add information to the word, for example if I show you the word "d*gh*us*" you can correctly guess it's "doghouse", even if the ⟨o⟩'s and the ⟨e⟩ are missing.