@brucethemoose

brucethemoose@lemmy.world · edit-2 7 hours ago

Tinygrad is (so far) software only, ostensibly sort of a lightweight PyTorch replacement.
Tinygrad is (so far) not really used for much, not even research or tinkering.

Between that and the lead dev’s YouTube antics, it kinda seems like hot air to me.

brucethemoose@lemmy.world · edit-2 7 hours ago

It could be if it’s run locally.

If you agents run on your hardware, navigate crappy apps and websites and such for you, what do you need the corporate cloud for? How can they show ads or monetize you through that?

That’s the war raging right now, open-weights vs closed weights.

brucethemoose@lemmy.world · edit-2 8 hours ago

They aren’t specialized though!

There are a lot of directions “AI” could go:

Is autoregressive bitnet going to take off? In that case, the compute becomes extremely light, and the thing to optimize for is memory bandwidth and cache.
Or diffusion or something with fewer passes like that? In that case, we go the opposite direction, throw bandwidth out the window, and optimize for matmul compute.
What if it’s both? In that case, one wants a truckload of ternary adders and not too much else.
Or what if some other form of sparsity takes over, as (given the effectiveness of quantization and MoE), there’s clearly a ton of sparsity to take advantage of. Nvidia already bet on this, but it hasn’t taken off yet.

There’s all sorts of wild directions the sector could go. Having the flexibility of an ASIC die would be a huge benefit for AMD, as they don’t have to ‘commit’ to any particular direction like Nvidia’s monolithic dies. If a new trend takes off, they can take an existing die and swap out the ASIC relatively quickly, without taping out a whole new GPU.

brucethemoose@lemmy.world · edit-2 10 hours ago

With AMD’s IP, they could make a hybrid chip, eg a (for example) bitnet ASIC hanging off a GPU for flexible, cuda-compatible compute where needed.

Nvidia sorta does this now (with tensor cores being a separate part of the die), but with their history of MCM designs, AMD could take it to an extreme.

brucethemoose@lemmy.world · edit-2 2 days ago

Narcissists hate being ignored or called unimportant. Trump flippantly dismissing him as “nuts” and moving on is the ultimate insult.

I’m sure Musk has an army reining him in, but that’s legitimately hard for Musk to ignore.

brucethemoose@lemmy.world · edit-2 2 days ago

One one more thing, I saw you mention context management.

Mistral (24B) models are really bad at long context, but this is not always the case. I find that Qwen 32B and Gemma 27B are solid at 32K (which is a huge body of text) and (with the right backend settings) you can easily run either at 64K with very minimal vram overhead.

Specifically, run Gemma with the latest llama.cpp server and comment (where it will automatically use sliding window attention as of like yesterday), or Qwen (and most other models) with exllamav2 or exllamav3, which quantizes the kv cache down to Q4 very efficiently.

This way you don’t need to manage context: you can feed the LLM the whole adventure so it doesn’t forget anything, and streaming responses will be instance since it’s always cached.

brucethemoose@lemmy.world · 2 days ago

Oh, one thing about ST specifically: its default sampling presets are catastrophic last I checked. Like, they’re designed for ancient models, and while I have nothing against the UI it is kinda from a different era.

For Gemma and Qwen, I’ve been using like 0.2-0.7 temp, at least 0.05 MinP, 1.01 rep penalty (not something insane like 1.1) and maybe 0.3-ish dry, though like you said dry/xtc can really mess up some tasks.

brucethemoose@lemmy.world · edit-2 2 days ago

Also, another suggestion would be to be careful with your sampling. Use a low temperature and high MinP for queries involving rules, higher temperature (+ samplers like DRY) when you’re trying to tease out interesting ideas.

I would even suggest an alt front end like mikupad that exposes token probabilities, so you can go to any point in the reply and look through every “idea” the LLM had internally (and regen from that point of you wish”). It’s also good for debugging sampling issues when you have an incorrect answer (as sometimes the LLM gets it right, but bad sampling parameters choose a bad answer).

brucethemoose@lemmy.world · 2 days ago

As long as it supports network inference between machines with heterogeneous cards, it would work for what I have in mind.

It probably doesn’t, heh, especially non Nvidia cards. But the middle layer may work with some generic OpenAI backend like the llama.cpp server.

brucethemoose@lemmy.world · 3 days ago

Yeah, most predatory apps are almost like cheap ripoffs of refined system casinos got down to a science.

brucethemoose@lemmy.world · edit-2 3 days ago

Both can be true.

It can be true that the FDA was corrupted/captured to some extent and needs more ‘skeptial’ and less-industry-friendly leadership. At the same time, skepticism in science is not the answer.

This is my dillema with MAGA. Many of the issues they tackle are spot on, even if people don’t like to hear that. They’re often right, even when the proposed solutions are wrong and damaging. I think this a lot when I hear RFK speak, nodding my head at the first assertion then grinding my teeth as he goes on.

brucethemoose@lemmy.world · 3 days ago

Late to the post, but look into SGLang, OP!

In a nutshell, it’s a framework for letting LLMs “fill in blanks” instead of generating entire replies, so you could script in rules as part of the responses as structure for it to grab onto. It’s all locally runnable (with the right hardware, unfortunately).

Also, there are some newer, less sycophantic DM specific models. I can look around if you want.

brucethemoose@lemmy.world · 4 days ago

brucethemoose@lemmy.world · 4 days ago

Heh. systemd-boot’s Windows entry broke for me over a year ago, and I haven’t been able to fix it. I have to boot Windows through UEFI.

brucethemoose@lemmy.world · 4 days ago

Depends on the application.

In some cases, it would be fantastic. But it’s clearly not a one size fits all, yeah.

brucethemoose@lemmy.world · 4 days ago

Eh, I still bet it was really people browsing /new who downvoted it.

Honestly I get it, with how enshittified corporate portals/use is already, but still.

brucethemoose@lemmy.world · edit-2 4 days ago

It’s not bots, it just how local ML posts are on the internet.

I got banned from a Reddit fandom sub for the mere suggestion that a certain fan ‘remaster’ be updated with newer diffusion/GAN models. Apparently they weren’t aware the original was made with Waifu2x… But unfortunately, anything tangential to tech bro AI is radioactive.

brucethemoose@lemmy.world · edit-2 4 days ago

TBH 2-3 would be good, since each browser takes a monumental amount of effort/money to optimize and maintain.

Like, my best case somewhat plausible scenario would be Apple (and maybe some other vested interests?) merging Firefox and Safari into one open source effort that can keep up with Google (with Safari being a “branded” Firefox). There just isn’t enough money for a couple of open efforts to keep up with Chromium.

brucethemoose@lemmy.world · 5 days ago

+1, though all this is a very unpopular opinion on most of the internet.

brucethemoose@lemmy.world · edit-2 5 days ago

Local models are not capable of coding yet, despite what benchmarks say. Even if they get what you’re trying to do they spew out so many syntax errors and tool calling problems that it’s a complete waste of time.

I disagree with this. Qwen Coder 32B and on have been fantastic for niches with the right settings.

If you apply a grammar template and/or start/fill in their response, drop the temperature a ton, and keep the actual outputs short, it’s like night and day vs ‘regular’ chatbot usage.

TBH one of the biggest problems with LLM is that they’re treated as chatbot genies with all sorts of performance-degrading workarounds, not tools to fill in little bits of text (which is what language models were originally concieved for).