Stable Diffusion

4977 readers

9 users here now

Discuss matters related to our favourite AI Art generation technology

Also see

Stable Diffusion Art (See its sidebar for more GenAI Art comms)
!aihorde@lemmy.dbzer0.com

Other communities

founded 2 years ago

MODERATORS

db0@lemmy.dbzer0.com

Even_Adder@lemmy.dbzer0.com

Is it OK to ask for help here? Getting lots of "loss: NaN" when training on Automatic1111. All training files come out garbage. (lemmy.starlightkel.xyz)

submitted 2 years ago by thebardingreen@lemmy.starlightkel.xyz to c/stable_diffusion@lemmy.dbzer0.com

5 comments fedilink hide all child comments

Casual hobbyist, not an expert here.

It WAS working... About eight months ago, I trained a bunch of embeddings and hypernetworks and it all worked great.

Cut to the present, I want to do some more training. I've updated Automatic1111 several times, but nothing else about my setup has changed. However, whenever I try to train anything (embeddings, hypernetworks or loras), loss is NaN for 4 out of 5 steps right from the get go. As the training progresses, loss becomes NaN for 9 out of 10 steps, then 19 out of 20 steps around step 3,000, which is as far as I've gotten. Hypernetworks just don't work at that point and embeddings produce garbage.

I have googled like crazy, and found

A few threads, where the best hint is that (at least 8-9 months ago) xformers broke training. Well, I've messed around with xformers, uninstalled and reinstalled xformers, eaten xformers for breakfast. Behavior is the same.

Lower training rate I have set my training rate to 0.0000000000000005. Behavior is identical.

My system is on the low end for VRAM (8G). I have TWO 8G cards, so I wish I could train on both like I can for Llama. But I also think that's not it, because my OLD embeddings and hypernetworks came out great and still work.

Any thoughts here?

top 5 comments

sorted by: hot top controversial new old

[–] Hubi@feddit.de 5 points 2 years ago* (last edited 2 years ago)

I've only had this happen when I messed something up in my training settings. Keep in mind that a very slow learning rate can cause this as well. That said, maybe try training with kohya_ss, it's much simpler and less resource-heavy than the training extension in Auto1111.

[–] BrianTheeBiscuiteer@lemmy.world 4 points 2 years ago (1 children)

Depending on how the dependencies (i.e. xformers) are versioned you can make a new clone of A1111 and checkout a commit from ~8 months ago to see if it works again. Of course I would also recommend trying a fresh install.

[–] thebardingreen@lemmy.starlightkel.xyz 1 points 2 years ago* (last edited 2 years ago)

Of course I would also recommend trying a fresh install.

Way ahead of you there. I've reinstalled the current version four or five times at this point.

make a new clone of A1111 and checkout a commit from ~8 months ago

This is a good idea. I've tried two different old versions from old commit hashes so far and both have crashed with other problems. It seems like (lol) both versions of A1111 put their venv in the same place, so the old versions are barfing on some dependencies with version numbers that are too high and they ALSO broke my current version by downgrading some other dependencies (easy fix, just wipe it out and reinstall it again). I'm trying to debug this, because I COULD see a world where I have an old version of A1111 training on one card while the NEW version generates on the other.

[–] Even_Adder@lemmy.dbzer0.com 3 points 2 years ago

I don't know much about training, but maybe these can help.

https://rentry.org/59xed3

https://github.com/derrian-distro/LoRA_Easy_Training_Scripts

[–] scrubbles@poptalk.scrubbles.tech 2 points 2 years ago

Check out the readme and the advanced parameters. Iirc there's literally a checkbox for like half vae or something that says "check this if you're getting NaN errors"