• 0 Posts
  • 32 Comments
Joined 1 year ago
cake
Cake day: March 22nd, 2024

help-circle


  • They aren’t specialized though!

    There are a lot of directions “AI” could go:

    • Is autoregressive bitnet going to take off? In that case, the compute becomes extremely light, and the thing to optimize for is memory bandwidth and cache.
    • Or diffusion or something with fewer passes like that? In that case, we go the opposite direction, throw bandwidth out the window, and optimize for matmul compute.
    • What if it’s both? In that case, one wants a truckload of ternary adders and not too much else.
    • Or what if some other form of sparsity takes over, as (given the effectiveness of quantization and MoE), there’s clearly a ton of sparsity to take advantage of. Nvidia already bet on this, but it hasn’t taken off yet.

    There’s all sorts of wild directions the sector could go. Having the flexibility of an ASIC die would be a huge benefit for AMD, as they don’t have to ‘commit’ to any particular direction like Nvidia’s monolithic dies. If a new trend takes off, they can take an existing die and swap out the ASIC relatively quickly, without taping out a whole new GPU.




  • One one more thing, I saw you mention context management.

    Mistral (24B) models are really bad at long context, but this is not always the case. I find that Qwen 32B and Gemma 27B are solid at 32K (which is a huge body of text) and (with the right backend settings) you can easily run either at 64K with very minimal vram overhead.

    Specifically, run Gemma with the latest llama.cpp server and comment (where it will automatically use sliding window attention as of like yesterday), or Qwen (and most other models) with exllamav2 or exllamav3, which quantizes the kv cache down to Q4 very efficiently.

    This way you don’t need to manage context: you can feed the LLM the whole adventure so it doesn’t forget anything, and streaming responses will be instance since it’s always cached.



  • Also, another suggestion would be to be careful with your sampling. Use a low temperature and high MinP for queries involving rules, higher temperature (+ samplers like DRY) when you’re trying to tease out interesting ideas.

    I would even suggest an alt front end like mikupad that exposes token probabilities, so you can go to any point in the reply and look through every “idea” the LLM had internally (and regen from that point of you wish”). It’s also good for debugging sampling issues when you have an incorrect answer (as sometimes the LLM gets it right, but bad sampling parameters choose a bad answer).




  • Both can be true.

    It can be true that the FDA was corrupted/captured to some extent and needs more ‘skeptial’ and less-industry-friendly leadership. At the same time, skepticism in science is not the answer.

    This is my dillema with MAGA. Many of the issues they tackle are spot on, even if people don’t like to hear that. They’re often right, even when the proposed solutions are wrong and damaging. I think this a lot when I hear RFK speak, nodding my head at the first assertion then grinding my teeth as he goes on.










  • Local models are not capable of coding yet, despite what benchmarks say. Even if they get what you’re trying to do they spew out so many syntax errors and tool calling problems that it’s a complete waste of time.

    I disagree with this. Qwen Coder 32B and on have been fantastic for niches with the right settings.

    If you apply a grammar template and/or start/fill in their response, drop the temperature a ton, and keep the actual outputs short, it’s like night and day vs ‘regular’ chatbot usage.

    TBH one of the biggest problems with LLM is that they’re treated as chatbot genies with all sorts of performance-degrading workarounds, not tools to fill in little bits of text (which is what language models were originally concieved for).