this post was submitted on 14 Sep 2025
40 points (97.6% liked)

Programming

22709 readers
42 users here now

Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!

Cross posting is strongly encouraged in the instance. If you feel your post or another person's post makes sense in another community cross post into it.

Hope you enjoy the instance!

Rules

Rules

  • Follow the programming.dev instance rules
  • Keep content related to programming in some way
  • If you're posting long videos try to add in some form of tldr for those who don't want to watch videos

Wormhole

Follow the wormhole through a path of communities !webdev@programming.dev



founded 2 years ago
MODERATORS
 

Soo I'm working on a database that needs to support multiple languages (two for now, but who knows). I stumbled across this blog post that explains how to develop what it calls a "translation subschema" (haven't seen it called like this anywhere else so I don't know if it's actually how you'd call it), which seems like a very nice way of dealing with things.

I'm not very experienced with DB stuff, so it took me a while to fully understand what it was doing, but now that (I think) I do, I'm wondering if I could just ignore the Languages table, and just use a language field in the tables TextContent and Translations, without loosing any functionality. (except of course having a table listing the available languages, which is not however something I'm interested in)

In my head everything would still work, I'd insert stuff with

INSERT INTO TextContent (OriginalText, OriginalLanguage)
VALUES ("Ciao", "it");

DECLARE TextContentId INT = SCOPE_IDENTITY();

INSERT INTO Translations (TextContentId, Language, Translation)
VALUES (@TextContentId, "it", "Ciao");
INSERT INTO Translations (TextContentId, Language, Translation)
VALUES (@TextContentId, "en", "Hello");

and given a TextContentId, i'd retrieve the correct translation with

SELECT Translation FROM Translations WHERE TextContentId = TCId AND Language = "en"

At this point, I'm thinking I could drop TextContent too, and just have a Translations table with TextContentId, Language, and Translation, with (TextContentId, Language) as primary key.

Am I missing something? I'm trying to simplify this solution but I don't want to risk making random errors.

Edit: translations on the DB are for user inserted text, which will also provide translations. The client then will only receive text it the correct language.

all 23 comments
sorted by: hot top controversial new old
[–] wetbeardhairs@lemmy.dbzer0.com 25 points 2 days ago (2 children)

You might want to consider using country codes in your language identifier.

en-US en-UK en-CA

all have slight variations in dialect. I don't speak other languages but I imagine it is important in many cases.

[–] eager_eagle@lemmy.world 16 points 2 days ago* (last edited 2 days ago) (1 children)

for reference, that's usually a ISO 639-1 combined with ISO 3166-1 alpha 2

and if an exact locale match is not available, it makes more sense to return another language match than the default language fallback

[–] czan@aussie.zone 2 points 2 days ago

I think IETF BCP 47 is the way to go for this (and is rougly a superset of what you're saying). They also define a "more/less specific" selationship between tags so you can ask for "en-CA" and have the system decide "en" is still appropriate.

[–] orsetto@lemmy.dbzer0.com 2 points 2 days ago

That's probably a good call, thanks :)

[–] TehPers@beehaw.org 15 points 2 days ago (2 children)

Localization is a hard problem, but storing your translations in the DB is a bit unusual unless you're trying to translate user data or something.

I'd recommend looking into tools like Project Fluent or similar that are designed around translating.

As for the schema you have, if you're sticking with it, I would change the language into an IETF language tag or similar instead. The important part is that it separates language variants. For example, US English and British (or international) English have differences, Brazilian Portuguese and Portugal Portuguese have differences, Mexican Spanish and Spain Spanish have differences, etc.

Using an ID instead of the text content itself as part of the PK should be a no-brainer. Languages evolve over time, and translations change. PKs should not. Your choice of PK = (TextContentId, Language) is the most reasonable to me, though I still think that translations should live as assets to your application instead to better integrate with existing localization tools.

One last thing: people tend to believe that translating is enough to localize. It is not. For example, RTL languages often swap the entire UI direction to RTL, not just the text direction. Also, different cultures sometimes use different colors and icons than each other.

[–] orsetto@lemmy.dbzer0.com 6 points 2 days ago (1 children)

Sorry, I didn't think to add in the post that the translations are in fact of user generated content, and are themselves provided by users.

Project Fluent is still a good resource tho, thank you.

And also yeah, I'll use a better schema for language tags, that's a clear fault

Using an ID instead of the text content itself as part of the PK should be a no-brainer. Languages evolve over time, and translations change. PKs should not.

~~I still don't get why having a separate table for languages is useful. I mean, even if the translation changes, the language itself will remain the same, right?~~

Oh, right. Taking into account language variants makes VERY obvious why I'd want to use a table to store them.

people tend to believe that translating is enough to localize. It is not.

Onestly, I just hope that won't be something i should have to worry about. The rest of the codebase is as shitty as it gets, and I don't want to be the one to refactor it for proper localization. I'm implementing a new feature that allows me some degree of movement to think about a good design for that, and new, features, but this is as far as I'll go (Yes I know I probably sound like an ass but it really is that bad)

[–] TehPers@beehaw.org 4 points 2 days ago (1 children)

I know I probably sound like an ass but it really is that bad

Nah I work in shitty codebases on a regular basis, and the less I need to touch them, the happier I am.

With regards to other localization changes, it's not important to localize everything perfectly, but it's good to be aware of what you can improve and what might cause some users to be less comfortable with the interface. That way you're informed and can properly justify a sacrifice (like "it'd cost us a lot of time to support RTL interfaces but only 0.1% of users would use them") rather than be surprised that there even is one being made.

Also, user-generated content explains why these are in a DB, and now it makes a lot more sense to me. User-generated translations used as-is makes more sense than trying to force Project Fluent (or other similar tools) into it.

[–] orsetto@lemmy.dbzer0.com 1 points 2 days ago

I mean for now it's not being requested to add other languages beside italian and english, and i'm pretty sure my employer will never care about languages he doesn't speak, so chances of languages that require some work other than translations are basically null.

[–] xianjam@programming.dev 1 points 2 days ago (1 children)

Why is storing resources in the database unusual? I've done that my entire career, and I've believed it to be a best practice.

[–] TehPers@beehaw.org 2 points 2 days ago

Storing UI assets in a database is unusual because assets aren't data, they are part of your UI. This is of course assuming a website - an application may choose to save assets in a local sqlite database or similar for convenience.

It's the same reason I wouldn't store static images in a database though - there's no reason to do so. Databases provide no additional value over just storing the images next to the code, and same with localizations.

User-generated content changes things because that data is now dynamically generated, not static assets for a frontend.

[–] Olap@lemmy.world 6 points 2 days ago

I'd suggest you get an i18n package for whatever language you are using and follow their conventions

[–] taaz@biglemmowski.win 4 points 2 days ago* (last edited 2 days ago)

Had to solve the same problem few months ago, user provided content and so, user provided translations.
We use postgres everywhere and we had to support 3 languages initially with one more eventually, so we decied to use json fields for anything that could be translated (which wasn't too much). Mind you, this was basically a (temporarily permanent) prototype project but (fresh) postgres has a good support and operators for json so it worked alright.

EDIT: I remembered that hstore might be a good alternative too, I think it was slightly less "heavy" and had better operators for the kind of access we needed

[–] HairyHarry@lemmy.world 4 points 2 days ago (1 children)

Better use a separate table for languages (int id, string language) and reference the language in your translation table by its ID.

That gives you a way to manage languages.

[–] orsetto@lemmy.dbzer0.com 2 points 2 days ago (2 children)

But why would I need that? (Onest question, I hope I don't sound rude)

I mean, I could easily retrieve the list of available languages, and it makes it faster to delete them using an ON DELETE CASCADE (right?), but it also complicates stuff a bit for general use

[–] bluGill@fedia.io 4 points 2 days ago (1 children)

So you can dump just that table and send to translators. you also know all missing translations with one query. If you add more languages you don't want to change all tables. All the normalization reasons apply too

[–] ChairmanMeow@programming.dev 2 points 2 days ago

None of those things necessarily require having a separate table for languages though.

[–] atzanteol@sh.itjust.works 3 points 2 days ago

That is how you end up with en, EN, English, etc.

[–] talkingpumpkin@lemmy.world 1 points 2 days ago (3 children)
INSERT INTO TextContent (OriginalText, OriginalLanguage)
VALUES ("Ciao", "it");

Shouldn't that be TextContent(TextContentId, OriginalText)? Something like

(then you should make the id a primary key, index originaltext and make the id in the other table a foreign key)

I could drop TextContent too, and just have a Translations table with TextContentId

Sure, but the you would have to reference the text via TextContentId in your code, which would be very annoying.

Instead you could have a function, say t("Ciao") that kinda runs something like (of course loading all the translations in ram at startup and referencing that would be better than running a query for each and every string).

select t.translation
  from textcontent tc
       join translations t on t.textcontentid = tc.textcontentid
 where tc.originaltext = ?
   and t.language = ?

The function could also return the originaltext and log an error message if a translation is not found.

BTW 1: most frameworks/languages have i18n facilities/libraries - you may investigate one and use it instead of rolling your own.

BTW 2: why would you put the translations in a database? what's the advantage compared to files?

[–] eager_eagle@lemmy.world 4 points 2 days ago* (last edited 2 days ago)

Instead you could have a function, say t("Ciao") that kinda runs something like (of course loading all the translations in ram at startup and referencing that would be better than running a query for each and every string).

this backfires when the same text translates to different strings depending on the context

e.g. EN "Play" may translate to PT "Jogar" (as in play a game) or "Reproduzir" (as in play a video)

[–] orsetto@lemmy.dbzer0.com 2 points 2 days ago* (last edited 2 days ago) (1 children)

Shouldn't that be TextContent(TextContentId, OriginalText)? Something like

(then you should make the id a primary key, index originaltext and make the id in the other table a foreign key)

Yup, TextContentId is PK, and I'd set to auto_increment so that i wouldn't have to worrya bout it.

Sure, but the you would have to reference the text via TextContentId in your code, which would be very annoying

Not really, the translations I keep in the database are for entity that would be retrueved from the client with already the translated values instead of ID (should have specified this in the post maybe)

BTW1/2: the translations on the database are for entities on the db (stuff like user generated content, which will also input the translations). For general text on the page I'm using a specific framework, which would be difficult to use for the user generated content, at least how it's set up now (it's angular's ngx-translate)

[–] talkingpumpkin@lemmy.world 1 points 2 days ago

the translations on the database are for entities on the db

Oh, then you could consider having one extra table per entity (one-to-many) with the translatable stuff:

create table some_entity (
    id .. primary key,
    -- fields for attributes that are not translated
    price ..,
    created_on ..,
    deleted_on ..,
    ..
);
create table some_entity_i18n(
    id .. primary key,
    some_entity_id .. foreign key references some_entity(id),
    locale ..,
    -- one field per translatable attribute
    title ..,
    description ..,
    ..
);

IMHO putting everything in one big table will only complicate things in the long run.