this post was submitted on 30 Nov 2023
303 points (99.0% liked)

Technology

74319 readers
2839 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] IHeartBadCode@kbin.social 168 points 2 years ago (9 children)

Well the issue at hand is that this is starting to get to the point that like the x86 arch, you cannot just move the NR_CPUS value upward and call it done. The kernel needs to keep some information on hand about the CPUs, it's usually about 8KB per CPU. That is usually allocated on the stack which is a bit of special memory that comes with some assurances like it being continuous and when things go out of scope they are automagically deallocated for you.

However, because of those special assurances, just simply increasing the size of the stack can create all kinds of issues. Namely TLB missing, which one of the things to make CPUs go faster is to move bits of RAM into some special RAM inside the CPU called cache (which there's different levels of cache and each level has different properties which is getting a bit too deep into details). The CPU attempts to make a guess as to the next bit of RAM that needs to move into cache before it's actually needed, this called prediction. Usually the CPU gets it right but sometimes it gets it wrong and the CPU must tell the actual core that it needs to wait while it goes and gets the correct bit of RAM, because the cores move way faster than the transfer of RAM to the cache, this is why the CPU needs to move the bits from RAM into cache before the core actually needs it.

So keeping the stack small pretty much ensures that you can fit the stack into one of the levels of cache on the CPU and allows the stack to be fast and have all that neat automagical stuff like deallocation when it goes out of scope. So you just cannot increase the NR_CPUS value because the stack will just get too large to nicely fit inside the cache, so it'll get broken up into "pages" with the current page in cache and the other one still in RAM and there will be swapping between the pages which can introduce TLB misses.

So the patch being submitted for particular configurations will set the CPUMASK_OFFSTACK flag. This moves that CPU information that's being maintained to be off of the stack. That is to be allocated with slab allocation. Slab allocation is a kernel allocation algorithm that's a bit different than if you did the usual C style malloc or calloc (which I will indicate that for any C programmers out there, you should use calloc first and if you have reasons use malloc. But calloc should be your go to for security reasons but I don't want to paper over details here by just saying use calloc and never use malloc. There's a difference and that difference is important in some cases).

Without deep diving into kernel slabs, slabs are a bit different in that they don't have some of those nice automagical things that come with the stack memory. So one must be a bit more careful with how they are used, but that's the nice thing about the slab allocator is that it's pretty smart about ensuring it's doing the right thing. This is for the 5.3 kernel, but I love the charts that give a overview of how the slab allocator works. It's pretty similar in 6.x kernels, but I don't have any nifty charts for that version, but if some does I will love you if you posted a link.

That said, it's a bit slower but a fair enough tradeoff until there's some change in ARM Cortex-X memory cache arrangement. Which going from memory I think Cortex-X4 has 32MB shared L3 cache, which if you have 8KB on the 8192 CPU max, you'll need 64MB just to hold the CPU bitmap in L3 which is slow compared to the other levels. And there's other stuff you're going to need in the cache at any given time so hogging it all is not ideal. Setting the limit for stack usage to 512 is good as that means the bitmap is just 4MB and you can schedule well ahead of time (the kernel has a prefetcher which things within the kernel can do all kinds of special stuff with it to indicate when a bit of RAM needs to be moved into cache, for us measly users we can only make a suggestion called a hint, to the prefetcher) when to move it all into cache or leave it in RAM. So it's a good balance for the moment.

But Server style ARM is making headway and so it makes sense to do a lot with it in the same way the kernel handles server style x86 and other server style archs like POWER and what not. But not mess with it too much for consumer style ARM, which hardly needs these massive bitmaps.

[–] Speculater@lemmy.world 59 points 2 years ago (4 children)

Did you just type this all out of the top of your head?! Great work, but who keeps all this specific knowledge at the ready? I'm impressed.

[–] registrert@lemmy.sambands.net 17 points 2 years ago (1 children)

I'm sure you're really knowledgeable about something particularly obsucre, your time to shine will come!

[–] Speculater@lemmy.world 7 points 2 years ago

Maybe lock picking, hashcat, or physics, but damn was this an impressive comment.

load more comments (2 replies)
load more comments (6 replies)