As a Norwegian:
- 100 is boiling
- 40 is we all gonna die
- 30 is hot
- 20 is a little warm
- 10 is a little cool
- 0 is cold
- -5 is maybe time for a jacket
- -10 shit, it's freezingly cold outside!
- -15 I'll stay indoors if I can
As a Norwegian:
"I had the misfortune to come across a leaked video of your CEO having some really questionable sexual intercourse with a really sketchy character, and it was truly disgusting. I can not in good conscience support a company led by such a horrible individual"
If they want feedback, give them feedback.
My #1 tip: Get familiar with docker, and running docker containers. It's a bit to get used to at start, but well worth it.
Put the services behind something like caddy or traefik.
Third, if you don't want to expose your ip address directly, I've heard good things about cloudflare tunnels
I need to be able to access it from both phone and PC, and in emergency from a random PC/phone. Also, I used the "share as web link" a lot in notion.
In addition it would be nice to be able to access it at work without having to install anything.
AppFlowy isn't web based? I couldn't find an option to just run a server and access it via browser anywhere in the install instructions.
You know what, I don't have a good answer to you here. I did a few small experiments on ChatGPT and it seems like it has some knowledge of if it will be able to complete it or not. This was with a pretty well known question though.
I tried to recreate an earlier experiment where I asked it to write about a friend of mine, which was in the news some time ago and have apparently a few entries in it's training data, but very little. ChatGPT would then consistently hallucinate facts about the person, including date of birth and sometimes date of death. In that case it knew the pattern of writing about a person including date of birth, and sometimes date of death, but it didn't know it didn't have that info and just filled in plausible looking data there. Now it insists on not knowing who that person is at all and refuses to write anything about him.
Anyway, you've given me some things to think about, thanks.
My experience too. The few times I've been stuck and decided to try chatgpt, it's been completely unhelpful, at best suggesting basic things that I checked within the first 5 minutes of troubleshooting.
That was the best case. Worst case it'd sprout some plausible looking nonsense that took time to check and dismiss.
With the occasional hostile takeover
First of all, this link is just to C# bindings of llama.cpp and so doesn’t contain the actual implementation.
I know, it's my code. I refactored it from some much less readable and usable c# code. I picked it because it more clearly shows the steps involved in generating text.
How do you know LLMs can’t look ahead? [...] How do you know it hasn’t written out the entire response in memory already after which it only shows you the first word?
Firstly, it goes against everything we know so far of how they operate, and secondly.. because they can't.
If you look at the C# code, the first step is in _process_tokens
function, where it feeds the context into llama_eval
. That goes through each token and updates the internal memory / model state. Since it saves state, if you already have processed some of the tokens you can tell it to skip them and start on the new ones.
After this function you have a state in memory, the current state of the LLM, as a result of the tokens it's seen so far.
When we are done with that, we go to the more interesting part, the _predict_next_token
function. Note that that takes a samplingparams
parameter. It then set some options, like if top k is not set it's set to length of the model's vocabulary (number of tokens it knows about), and repeat_last_n, if not set, is set to the length of the existing context.
The code then gets the model's vocabulary, aka all the tokens it knows about, and then it generates the logits. The logits is an array the length of the vocabulary, with a number for each token showing how likely that one is the next token. The code then adds any specified token bias to that token's number. Already here, even if it already had a specific answer in mind, you can see problems starting.
Then the code adds token repetition penalty, based on the samplingparams
. This means that if a token repeats inside the given history, it's value will be lowered according to the repeat_penalty. Again, even if it had a specific answer, this has a high chance of messing that up. The same is done for frequency and presence. For more details of what those native functions do, you can see the llama.cpp source - they have the same name there.
After all the penalties are applied, it's time to pick the token. If the temp is 0 or lower, it just picks the highest rated token (aka greedy sampling). This tends to give very boring and flat responses, but it's predictable and reproduceable, so it's often used in benchmarks of various kinds.
But if that's not used (which it almost never is in "real" use), there are several methods. You have MiroStat, which tries to create more consistent quality between different answer lengths, and the "traditional" using top-k, top-p and temperature.
Common for them is however that internally it produces a top list of candidates, and then pick one at random. And that's why a LLM can't plan ahead.
When a token ID is eventually produced it returns the new ID, that gets added to context, the text equivalent of the token is looked up and sent back to the UI, and the new context is fed into llama_eval and the process starts again.
For the LLM to even be able to plan an answer ahead it must know of all penalties and parameters (or have none applied), and greedy token prediction must be used.
And that is why, even if it had some sort of near magical ability to plan ahead that we just don't know is there, at the end of the day it could still not plan a specific response.
Just ask if you want some clarification.
As for GPU, I'm waiting.. IMHO it's just too expensive now. And sadly, Nvidia is currently the only game in town. Some software works on amd, but just about everything works on Nvidia.
That said, my PC has 48gb system ram, and I can run 65b models on it with about 1s per token. With a few layers offloaded to my 10gb GPU. That would otherwise require 2x 3090 or 4090 (2x4090 would be about 20x faster though..)
No reason? It's probably meetings, then more meetings, add some meetings, and you guessed it, meetings.
Like the follow tab mentioned it's probably first product owners meetings to agree on what a user would expect.. and there's always someone having a wild opinion or two that needs to be "hashed out". Then when that's done it's meetings with the UX team, then they have a meeting on their own, then a new meeting with product owner, UX and designers, then after that frontend team is in the loop, then back to UX and prod owner, then a new round, then it's time for backend to come in, first one with PO and frontend, then a technical one to agree on how to do it, then database team is involved, they refuse to change a small thing and expected functionality needs to be changed, back to PO, UX, frontend, backend, and then finally maybe a dev or two can sit down and add it. Which takes 2 hours. After six weeks of meetings.
And then comes testing of course, and signoff on the functionality.
"Fast" is nowhere in enterprise development.
Just remember to apply opposite logic to anything republicans say, and it's clearer.