So, a small update on what I’ve been doing lately

AlexH

Administrator
Staff member
I’m experimenting, and like I mentioned before, I’ve hit a pretty hard hardware wall. My current limit is my GPU: an RTX 4090 with 24GB VRAM. That’s fine for running tests on small models, up to around 8B parameters. But the moment you try to scale, things change completely. For what I’m doing, you realistically need models in the 27–70B range at minimum.

After about 50 hours, I managed to run a training test on Gemma 3 27B IT. Training itself was possible, but evaluation is another story. I simply can’t run proper evals because they take an insane amount of time. And without evaluation, it’s hard to clearly see what the model actually learned and where the differences are.

Models in the 1B–4B–8B range are fine for simple or generic tasks. But for very advanced, highly specific datasets like the ones I’m working with, they struggle a lot. They don’t really “get it”, and learning is shallow. For this kind of data, you need at least a 70B model, and honestly, based on the complexity, something closer to 400B would make much more sense.

To give you an idea: these datasets are so specific and advanced that they’re best used on models like ChatGPT 5+, Grok 4.1, Claude 4.5, and similar. Now imagine pushing that kind of data into an 8B model. That’s basically what I’m trying to test and understand.

So far, based on my current experiments, my private training method seems to work. The trained model ends up at roughly 95–98% of the original model’s overall capability. In some specific aspects and evaluations, it actually outperforms the base model by a noticeable margin.

I’m also trying to build a model from scratch, completely from zero. But realistically, I have to admit I don’t know how to do that yet. That means learning… a lot. For every single thing I want to do, I end up needing to learn ten other things first. That’s what eats most of my time right now.

At least for the moment, based on my own subjective analysis (and yeah, take that with a grain of salt 😄), it feels like starting from scratch might actually give me better chances in the long run. Mainly because I wouldn’t be constrained by existing architectures and their limitations.

I know how it sounds when I say I want to do all of this on a single 4090, while others burn millions of dollars just to train an 8B model. But step by step, things move forward. I’ll keep testing until either I find someone willing to offer a 20TB GPU cluster for 24 months… or 15 million dollars. Both are equally realistic lotteries.

And the risk is real on both sides. Not just for whoever would put up that kind of money, but also for me, potentially losing two years for nothing. Still, as the saying goes, you never really lose — you always learn something from it.

That’s where things stand right now.

Up until recently I was working mainly with Gemma 3 models. I’ve now switched to LLaMA 3.1 8B Instruct. It seems faster and more stable so far, and overall it behaves better. We’ll see how it holds up.

From the first test with LLaMA, I managed to bring the fine-tuned model to about 95% overall compared to the original base model. Now I need to see how it performs in the rest of the evaluations.

More to come.
 
Technical Update: Methodology Wins, Scaling Challenges, and the "25% Breakthrough"

Project Status:
Active / Iteration Phase
Model: Llama 3.1 8B Instruct (Custom STO Adapter)
Current Focus: Full Dataset Integration & Logic Stabilization

1. The "Grind" Behind the Progress
I know things have been moving at a steady, albeit measured, pace. I want to address the "why" behind the timeline. We are currently operating under two main pressures: hardware limitations and a steep daily learning curve. Every day, I am diving deeper into the architecture, fixing code, and optimizing pipelines. For me, this project is a marathon of learning. Navigating hardware bottlenecks isn't just a delay; it’s a forcing function that makes me write more efficient code and develop smarter training strategies.

2. Identifying the "25% Bottleneck"
We recently hit a significant milestone by identifying a critical bug in the training pipeline. For several previous runs, the model was only "seeing" roughly 25% of the intended 20 million token dataset.

While that sounds like a setback, it’s actually a massive validator. If my previous checkpoints were outperforming the base Llama 3.1 model while only utilizing a fraction of the data, it proves that the STO (Specialized Task Optimization) method and the underlying data quality are exceptionally strong. We are now moving toward a full 100% utilization run, bridging those final knowledge gaps.

3. The Methodology: Synthetic Data meets STO
The synergy between my synthetic data generation and the private STO method is proving to be the project's cornerstone.

  • The synthetic data provides the "expert" knowledge base.
  • The STO method ensures that Llama 3.1 stays "sane"—retaining its core reasoning and instruction-following capabilities without succumbing to catastrophic forgetting.
Preliminary tests from current checkpoints show that even at 0.5 epochs with the full dataset, the model is picking up complex nuances in Machine Learning and Humanities that the base model simply doesn't possess.

4. Scaling the Vision: The Path to 500M Tokens
To put things in perspective, the current 20M token dataset is what I would call a "Grade 6" on a quality scale of 1-10. It’s solid, but I have higher-tier datasets in the works.

I am currently architecting a 500 Million token dataset composed of what I call "Grade 20" data—information so dense and high-quality it’s off the charts. If a "Grade 6" at 25% capacity can beat the benchmarks, the full-scale implementation of the 500M set will be a paradigm shift for what an 8B model can achieve.

5. Looking Ahead
We still have a long road to travel. The hardware is at its limit, and the learning never stops. But the results don't lie: we are on the right path. The next major update will feature the results from the full 2-epoch run on the 20M set using the fixed data loader.

Thank you for the support and the patience. We’re not just training a model; we’re refining a new way to specialize small-parameter architectures.

Stay tuned for the next log.
 
Back
Top