I’m experimenting, and like I mentioned before, I’ve hit a pretty hard hardware wall. My current limit is my GPU: an RTX 4090 with 24GB VRAM. That’s fine for running tests on small models, up to around 8B parameters. But the moment you try to scale, things change completely. For what I’m doing, you realistically need models in the 27–70B range at minimum.
After about 50 hours, I managed to run a training test on Gemma 3 27B IT. Training itself was possible, but evaluation is another story. I simply can’t run proper evals because they take an insane amount of time. And without evaluation, it’s hard to clearly see what the model actually learned and where the differences are.
Models in the 1B–4B–8B range are fine for simple or generic tasks. But for very advanced, highly specific datasets like the ones I’m working with, they struggle a lot. They don’t really “get it”, and learning is shallow. For this kind of data, you need at least a 70B model, and honestly, based on the complexity, something closer to 400B would make much more sense.
To give you an idea: these datasets are so specific and advanced that they’re best used on models like ChatGPT 5+, Grok 4.1, Claude 4.5, and similar. Now imagine pushing that kind of data into an 8B model. That’s basically what I’m trying to test and understand.
So far, based on my current experiments, my private training method seems to work. The trained model ends up at roughly 95–98% of the original model’s overall capability. In some specific aspects and evaluations, it actually outperforms the base model by a noticeable margin.
I’m also trying to build a model from scratch, completely from zero. But realistically, I have to admit I don’t know how to do that yet. That means learning… a lot. For every single thing I want to do, I end up needing to learn ten other things first. That’s what eats most of my time right now.
At least for the moment, based on my own subjective analysis (and yeah, take that with a grain of salt
), it feels like starting from scratch might actually give me better chances in the long run. Mainly because I wouldn’t be constrained by existing architectures and their limitations.
I know how it sounds when I say I want to do all of this on a single 4090, while others burn millions of dollars just to train an 8B model. But step by step, things move forward. I’ll keep testing until either I find someone willing to offer a 20TB GPU cluster for 24 months… or 15 million dollars. Both are equally realistic lotteries.
And the risk is real on both sides. Not just for whoever would put up that kind of money, but also for me, potentially losing two years for nothing. Still, as the saying goes, you never really lose — you always learn something from it.
That’s where things stand right now.
Up until recently I was working mainly with Gemma 3 models. I’ve now switched to LLaMA 3.1 8B Instruct. It seems faster and more stable so far, and overall it behaves better. We’ll see how it holds up.
From the first test with LLaMA, I managed to bring the fine-tuned model to about 95% overall compared to the original base model. Now I need to see how it performs in the rest of the evaluations.
More to come.
After about 50 hours, I managed to run a training test on Gemma 3 27B IT. Training itself was possible, but evaluation is another story. I simply can’t run proper evals because they take an insane amount of time. And without evaluation, it’s hard to clearly see what the model actually learned and where the differences are.
Models in the 1B–4B–8B range are fine for simple or generic tasks. But for very advanced, highly specific datasets like the ones I’m working with, they struggle a lot. They don’t really “get it”, and learning is shallow. For this kind of data, you need at least a 70B model, and honestly, based on the complexity, something closer to 400B would make much more sense.
To give you an idea: these datasets are so specific and advanced that they’re best used on models like ChatGPT 5+, Grok 4.1, Claude 4.5, and similar. Now imagine pushing that kind of data into an 8B model. That’s basically what I’m trying to test and understand.
So far, based on my current experiments, my private training method seems to work. The trained model ends up at roughly 95–98% of the original model’s overall capability. In some specific aspects and evaluations, it actually outperforms the base model by a noticeable margin.
I’m also trying to build a model from scratch, completely from zero. But realistically, I have to admit I don’t know how to do that yet. That means learning… a lot. For every single thing I want to do, I end up needing to learn ten other things first. That’s what eats most of my time right now.
At least for the moment, based on my own subjective analysis (and yeah, take that with a grain of salt
I know how it sounds when I say I want to do all of this on a single 4090, while others burn millions of dollars just to train an 8B model. But step by step, things move forward. I’ll keep testing until either I find someone willing to offer a 20TB GPU cluster for 24 months… or 15 million dollars. Both are equally realistic lotteries.
And the risk is real on both sides. Not just for whoever would put up that kind of money, but also for me, potentially losing two years for nothing. Still, as the saying goes, you never really lose — you always learn something from it.
That’s where things stand right now.
Up until recently I was working mainly with Gemma 3 models. I’ve now switched to LLaMA 3.1 8B Instruct. It seems faster and more stable so far, and overall it behaves better. We’ll see how it holds up.
From the first test with LLaMA, I managed to bring the fine-tuned model to about 95% overall compared to the original base model. Now I need to see how it performs in the rest of the evaluations.
More to come.