hckrnws
Building SQLite with a small swarm
by kyars
Hope some find this post interesting on my experience with parallel coding agents.
If it works, then it’s impressive. Does it work? Looking at test.sh, the oracle tests (the ones compared against SQLite) seem to consist in their entity of three trivial SELECT statements. SQLite has tens of thousands of tests; it should be possible to port some of those over to get a better idea of how functional this codebase is.
Edit: I looked over some of the code.
It's not good. It's certainly not anywhere near SQLite's quality, performance, or codebase size. Many elements are the most basic thing that could possibly work, or else missing entirely. To name some examples:
- Absolutely no concurrency.
- The B-tree implementation has a line "// TODO: Free old overflow pages if any."
- When the pager adds a page to the free list, it does a linear search through the entire free list (which can get arbitrarily large) just to make sure the page isn't in the list already.
- "//! The current planner scope is intentionally small: - recognize single-table `WHERE` predicates that can use an index - choose between full table scan and index-driven lookup."
- The pager calls clone() on large buffers, which is needlessly inefficient, kind of a newbie Rust mistake.
However…
It does seem like a codebase that would basically work. At a large scale, it has the necessary components and the architecture isn't insane. I'm sure there are bugs, but I think the AI could iron out the bugs, given some more time spent working on testing. And at that point, I think it could be perfectly suitable as an embedded database for some application as long as you don't have complex needs.
In practice, there is little reason not to just reach for actual SQLite, which is much more sophisticated. But I can think of one possible reason: SQLite has been known to have memory safety vulnerabilities, whereas this codebase is written in Rust with no unsafe code. It might eat your data, but it won't corrupt memory.
That is impressive enough for now, I think.
> But I can think of one possible reason: SQLite has been known to have memory safety vulnerabilities, whereas this codebase is written in Rust with no unsafe code.
I've lost every single shred of confidence I had in the comment's more optimistic claims the moment I read this.
If you read through SQLite's CVE history, you'll notice most of those are spurious at best.
Some more context here: https://sqlite.org/cves.html
I am using sqlite in my project. It definitely solves problems, but I keep seeing overly arrogant and sometimes even irresponsible statements from their website, and can't really appreciate much of their attitude towards software engineering. The below quote from this CVE page is one more example of such statements.
> All historical vulnerabilities reported against SQLite require at least one of these preconditions:
> 1. ...
> 2. The attacker can submit a maliciously crafted database file to the application that the application will then open and query.
> Few real-world applications meet either of these preconditions, and hence few real-world applications are vulnerable, even if they use older and unpatched versions of SQLite.
This 2. precondition is literally one of the idiomatic usage of sqlite that they've suggested on their site: https://sqlite.org/appfileformat.html
SQLite is tested against failure to allocate at every step of its operation: running out of memory never causes it to fail in a serious way, eg data loss. It's far more robust than almost every other library.
assuming your malloc function returns NULL when out of memory. Linux systems don't. They return fake addresses that kill your process when you use them.
Lucky that SQLite is also robust against random process death.
Comment was deleted :(
That's not how Linux memory management works, there are no poison values. Allocations are deferred until referenced (by default) and when a deferred allocation fails that's when you get a signal. The system isn't giving you a "fake address" via mmap.
My interpretation of the GP comment is that you are saying the same thing. Linux will return a pointer that is valid for your address space mappings, but might not be safe to actually use, because of VM overcommit. Unixes in general have no way to tell the process how much heap can be safely allocated.
Unfortunately it is not so easy. If rigorous tests at every step were able to guarantee that your program can't be exploited, we wouldn't need languages like Rust at all. But once you have a program in an unsafe language that is sufficiently complex, you will have memory corruption bugs. And once you have memory corruption bugs, you eventually will have code execution exploits. You might have to chain them more than in the good old days, but they will be there. SQLite even had single memory write bugs that allowed code execution which lay in the code for 20 years without anyone spotting them. Who knows how many hackers and three letter agencies had tapped into that by the time it was finally found by benevolent security researchers.
I'm not impressed:
- if you're not passing SQLite's open test suite, you didn't build SQLite
- this is a "draw the rest of the owl" scenario; in order to transform this into something passing the suite, you'd need an expert in writing databases
These projects are misnamed. People didn't build counterstrike, a browser, a C compiler, or SQLite solely with coding agents. You can't use them for that purpose--like, you can't drop this in for maybe any use case of SQLite. They're simulacra (slopulacra?)--their true use is as a prop in a huge grift: tricking people (including, and most especially, the creators) into thinking this will be an economical way to build complex software products in the future.
I'm generally not this pedantic, but yeah, "I wrote an embedded database" is fine to say. If you say "I built SQLite", I expected to at least see how many of the SQLite tests your thing passed.
Also, the very idea is flawed. These are open-source projects and the code is definitely part of the training data.
That's why our startup created the sendfile(2) MCP server. Instead of spending $10,000 vibe-coding a codebase that can pass the SQLite test suite, the sendfile(2) MCP supercharges your LLM by streamlining the pipeline between the training set and the output you want.
Just start the MCP server in the SQLite repo. We have clear SOTA on re-creating existing projects starting from their test suite.
This would be relevant if you could find matching code between this and sqlite. But then that would invalidate basically any project as "not flawed" really - given GitHub, there's barely any idea which doesn't have multiple partial implementations already.
Even if was copying sqlite code over, wouldn't the ability to automatically rewrite sqlite in Rust be a valuable asset?
Not really because it's not possible for SQLite written in Rust to pass SQLite's checks. See https://www.sqlite.org/whyc.html
That doesn't seem to support your claim; guessing you mean:
> "2. Safe languages insert additional machine branches to do things like verify that array accesses are in-bounds. In correct code, those branches are never taken. That means that the machine code cannot be 100% branch tested, which is an important component of SQLite's quality strategy."
'Safe' languages don't need to do that, if they can verify the array access is always in bounds at compile time then they don't need to emit any code to check it. That aside, it seems like they are saying:
for (int i=0; i<10; i++) {
foo(array[i]);
}
in C might become the equivalent of: for (int i=0; i<10; i++) {
if (i >= array_lower && i < array_higher) {
foo(array[i]);
} else {
??? // out of bounds, should never happen
}
}
in a 'safe' language, and i will always be in inside the array bounds so there is no way to test the 'else' branch?But that can't be in SQLite's checks as you claim, because the C code does not have a branch there to test?
Either way it seems hard to argue that a bounds test which can never fail makes the code less reliable and less trustworthy than the same code without a bounds test, using the argument that "you can't test the code path where the bounds check which can never fail, fails" - because you can use that same argument "what if the C code for array access which is correct, sometimes doesn't run correctly, you can't test for that"?
Correct, that's what I mean. I trust SQLite's devs to know more about this, so I trust what they wrote. There are parts of Rust code that are basically:
do_thing().expect(...);
This branch is required by the code, even if it can't be reached, because the type system requires it. It's not possible to test this branch, therefore 100% coverage is impossible in those cases.You normally count/test branches at the original language level, not the compiled one. Otherwise we'd get VERY silly results like:
- counting foo().except() as 2 branches
- counting a simple loop as a missed branch, because it got unrolled and you didn't test it with 7,6,5,4,3,2,1 items
- failing on unused straight implementation of memcpy because your CPU supports SIMD and chose that alternative
Etc. The compiled version will be full of code you'll never run regardless of language.
The type system does not require that. You can just discard the result:
let _ = do_thing();sorry for misleading, added an update stating that this is a simulacra of sqlite
Well--given a full copy of the SQLite test suite, I'm pretty sure it'd get there eventually. I agree that most of these show-off projects are just prop pieces, but that's kind of the point: Demonstrate it's technically possible to do the thing, not actually doing the thing, because that'd have diminishing returns for the demonstration. Still, the idea of setting a swarm of agents to a task, and, given a suitable test suite, have them build a compliant implementation, is sound in itself.
Sure, but that presumes that you have that test suite written without having a single line of application code written (which, to me, is counterintuitive, unrealistic, and completely insane)
SQLite apparently has 2 million tests! If you started only with that and set your agentic swarm against it, and the stars aligned and you ended up with a pristine, clean-room replica that passes everything, other than proof that it could be done, what did you achieve? You stood on the shoulders of giants to build a Bizarro World giant that gets you exactly back to where you began?
I'd be more interested in forking SQLite as-is, setting a swarm of agents against it with the looping task to create novel things on top of what already exists, and see what comes out.
[0] https://en.wikipedia.org/wiki/SQLite#Development_and_distrib...
You think an implementation of SQLite in another language, with more memory safety, has no value?
I agree that this current implementation is not very useful. I would not trust it where I trust SQLite.
Regardless, the potential for having agents build clean room implementations of existing systems from existing tests has value.
> I'm pretty sure it'd get there eventually.
Why? The combinatorics of “just try things until you get it right” makes this impractical.
If you minimax for passing the SQLite test suite, I’m still not sure you’ll have a viable implementation. You can’t prove soundness of code through a test suite alone.
agreed!
> tricking people (including, and most especially, the creators),
I believe it's an ad. Everything about it is trying so hard to seem legit and it's the most pointless thing I have ever seen.
IIRC the official test-suite is not open-source, so I'm not sure how possible this is.
You do not recall correctly. There is more than 500K SLOC of test code in the public source tree. If you "make releasetest" from the public source tarball on Linux, it runs more than 15 million test cases.
It is true that the half-million lines of test code found in the public source tree are not the entirety of the SQLite test suite. There are other parts that are not open-source. But the part that is public is a big chunk of the total.
Out of curiosity, why aren't all tests open source?
One set of proprietary tests is used in their specialist testing service that is a paid for service.
What is that service used for besides SQLite?
It's still SQLite, they just need to make money: https://sqlite.org/prosupport.html
Edit: also this:
> TH3 Testing Support. The TH3 test harness is an aviation-grade test suite for SQLite. SQLite developers can run TH3 on specialized hardware and/or using specialized compile-time options, according to customer specification, either remotely or on customer premises. Pricing for this services is on a case-by-case basis depending on requirements.
That's interesting. Here is more information https://sqlite.org/th3.html
The roots of SQLite are in defence industry projects of US Navy and General Dynamics. Seems like TH3 might be of interest for these sort of users.
One could assume also for Fossil.
> I think the AI could iron out the bugs, given some more time spent working on testing
I would need to see evidence of that. In my experience it's really difficult to get AI to fix one bug without having it introduce others.
Have it maintain and run a test suite.
Why do people fall for this. We're compressing knowledge, including the source code of SQLite into storage, then retrieve and shift it along latents at tremendous cost in a while loop, basically brute forcing a franken version of the original.
Because virtually all software is not novel. For each single partially novel thing, there are tens of thousands of crud apps with just slightly different flow and data. This is what almost every employed programmer does right now - match the previous patterns and produce a solution that's closer to the company requirements. And if we can brute force that quickly, that's beneficial for many people.
> Because virtually all software is not novel.
That isn't true, not by a long shot. Improvements happen because someone is inspired to do something differently.
How will that ever happen if we're obsessed with proving we can reimplement shit that's already great?
At the code level it's still rehashing the same ideas over and over again. I wrote lots of things from software 3d on a weird system to jit to websites to telephony software to compilers to firmware for hardware to cloud orchestration and many other things and none of this was novel - someone wrote every single pattern from them before even if nobody put them together the same way. Putting known pieces together is not novel. And as a proportion, almost all software produced is just business apps of various types, with absolutely nothing novel in them.
Also from actual researchers, I know just one person who did something actually novel and it was with queuing.
> At the code level it's still rehashing the same ideas over and over again.
I agree that rehashing the same ideas over and over again is sufficient - for some strange, complacent definition of the word. It's not the only way to think about the discipline, and thank goodness enough smart people realize that.
> Also from actual researchers, I know just one person who did something actually novel and it was with queuing.
Think how many people have to be trying at any given time for it to happen at all.
Months (years?) of publicity from AI companies telling us that the AI is nearing AGI and will replace programmers. Some people are excited about that future and want it now.
In reality, LLMs can (currently) build worse versions of things that already exist: a worse database than SQL, a worse C compiler than GCC, a worse website than one done by a human. I'd really like to see some agent create a better version of something that already exists, or, at least, something relatively novel.
>a worse database than SQL, a worse C compiler than GCC, a worse website than one done by a human.
But it enables people who can't do these things at all to appear to be able to do these things and claim reputation and acclaim that they don't deserve for skills they don't have.
I agree.
While I'm generally sympathetic to the idea that humans and LLM creativity is broadly similar (combining ideas absorbed elsewhere in new ways), when we ask for something that already exists it's basically just laundering open source code
License laundering and the ability to not credit or pay the original developers.
Laundering public domain code no less
copyright laundering machine. which could poison the very notion of ip / copyright, either open or close source. the only code that can't be laundered becomes code hidden behind a server api
"Implements + tests against sqlite3 as oracle"
That's the real unlock in my opinion. It's effectively an automated reverse engineering of how SQLite behaves, which is something agents are really good at.
I did a similar but smaller project a couple of weeks ago to build a Python library that could parse a SQLite SELECT query into an AST - same trick, I ran the SQLite C code as an oracle for how those ASTs should work: https://github.com/simonw/sqlite-ast
Question: you mention the OpenAI and Anthropic Pro plans, was the total cost of this project in the order of $40 ($20 for OpenAI and $20 for Anthropic)? What did you pay for Gemini?
yes, in the order of $50 let's say, although with api I believe it would be in the hundreds
Gemini is free, I don't even know if they have a paid plan?
> 84 / 154 commits (54.5%) were lock/claim/stale-lock/release coordination.
Parallelism over one code base is clearly not very useful.
I don't understand why going as fast as possible is the goal. We should be trying to be as correct as possible. The whole point is that these agents can run while we sleep. Convergence is non linear. You want every step to be in the right direction. Think of it more as a series of crystalline database transactions that must unroll in perfect order than a big pile of rocks that needs to be moved from a to b.
Orchestration and autonomy are the things people get hyped about, but validation is the real bottleneck, and I'm pretty sure it's not amenable to complete automation. The people pushing orchestration the hardest are trying to get their users to validate for them, which taints the AI related open source ecosystem for everyone (sorry Steve/Peter!).
I wrote a rant about this a while back to try and encourage people to be more responsible: https://sibylline.dev/articles/2026-01-27-stop-orchestrating...
I think we can now begin to experimentally test Conway's law and corollaries.
Agreed, a flat set of workers configured like this is probably not the best configuration.
Can you imagine what an all human team configured like this would produce?
What's the point of building something that already exists in open source. It's just going to use code that already exists. There's probably dozens of examples written by humans that it can pull from.
What do you suggest we build instead, that hasn't already been done? I've been developing for decades, and I can't think of a single thing that hasn't already been kind of done either in the same or other language, or at least similar.
I want a language with:
- the memory, thread safety, and build system of Rust
- the elegant syntax of OCaml and Haskell
- the expressive type system of Haskell and TypeScript
- the directness and simplicity of JavaScript
Think coding agents can help here?
You have conflicting requirements there - expressive type systems are not direct and simple. And elegant is subjective.
But seriously though: have you tried to see how far you can get with the design right now? You can start iterating on it already, even if the implementation will lag.
I do not have conflicting requirements. Expressive type system ARE direct and simple.
Expressive power is the ratio how strongly/clearly you can encode invariants to how complex and ceremonious the syntax of it needs to be.
See how JS, a language usually seen as a middling/mediocre language, can distill the basic good parts of OOP into very direct and clear idioms? I can just create an object literal and embed simple methods on them that receive the "this" pointer and use it. The constructor would be just a regular function. None of the cruft of standard OOP.
See how you define an enumerable union in TypeScript? Very simple. And yet I can think of many major languages that do not have this, certainly not with a lot of ceremony and complexity.
And I can go on.
> the expressive type system of Haskell
> Expressive type system ARE direct and simple.
Those result in a conflict because given expressive types, people will make them not simple. For example, you realise why Haskell continuously gets academic papers like "Functor is to Lens as Applicative is to Biplate; Introducing Multiplate"? There's no reason for something like that in Go for example, because it's less expensive and the signatures are trivial.
> JS (...) can distill the basic good parts of OOP into very direct and clear idioms?
Clear in that one specific project context that you need to know. Perl suffers from this. You can build your own OOP, so many people did and now there are hundreds of mostly-compatible-but-not-fully versions of OOP with different syntax and semantics all over the place.
> There's no reason for something like that in Go for example, because it's less expensive [sic] [1] and the signatures are trivial.
Whoah, Are you saying, for example, that generics are not useful? That's quite the claim. In that case, to make things even simpler, let's remove recursion, functions as data, even type systems altogether, because they lead to "complex" code. See where your reasoning leads?
Any language feature can be abused. In the same vein, you also say:
> Perl suffers from this. You can build your own OOP, so many people did and now there are hundreds of mostly-compatible-but-not-fully versions of OOP with different syntax and semantics all over the place.
One can create OOP in any most modern Turing-Complete languages, so this is not a strong argument.
[1] I believe you meant "expressive" here
> Are you saying, for example, that generics are not useful? That's quite the claim.
That's why I didn't make that claim. If you just want to invent straw men, I'm out.
All of those things have been built before, you're even referencing existing languages that have those "features". Parent seemingly was asking for people to build something completely novel, that doesn't have any FOSS code available that done that thing before.
And yes, LLMs/agents can help you do it for sure, I'm currently building the lisp of my dreams in my freetime, and already have compiler, interpreter, UI framework and some other things already done in a way I'm happy with.
Yeah, the "novel" bit is about integrating all those aspects into one language.
And trust me, such a language that captures enough mindshare is absolutely needed. People thought Rust was going to be it, but it got taken over by the idea of it being the next C++.
IF LLMs are what you make them out to be, it shouldn't have been long before we saw serious attempts at such languages, but I suspect LLMs are of barely any help here beyond some basic implementation tasks.
> Yeah, the "novel" bit is about integrating all those aspects into one language.
But do you think GP, who I initially wrote that comment to, would agree with that? All those features and integrations have examples in the FOSS world already, wouldn't the LLM just use what it learned from that?
> but I suspect LLMs are of barely any help here beyond some basic implementation tasks.
Disagree, as mentioned I've already managed to get together my own language that works for my purposes, and they did more than just "basic implementation tasks" although of course I've been reviewing and deciding stuff, no vibe coding here.
I tried that over a month
except '- the directness and simplicity of JavaScript'
https://github.com/artpar/guage
But somehow the language feels so foreign. it can obviously do hello world, but I don't have a real use case
PS: the "Pure symbols only" is no longer true, most symbols have been converted to English names
and, the "days" you see there in the markdowns are "claude code sessions", not actual days
Nim comes close to what you want.
Looking a bit further out, F# and Swift also come close.
Great work! Obviously the goal of this is not to replace sqlite, but to show that agents can do this today. That said, I'm a lot more curious about the Harness part ( Bootstrap_Prompt, Agent_Prompt, etc) then I am in what the agents have accomplished. Eg, how can I repeat this myself ? I couldn't find that in the repo...
hello, thanks! all of the harnessing is in this repo: https://github.com/kiankyars/parallel-ralph/
This blog post doesn't say anything about your experience.
How well does the resulting code perform? What are the trade-offs/limitations/benefits compared to SQLite? What problems does it solve?
Why did you use this process? this mixture of models? Why is this a good setup?
the code has not been rigorously tested in all honesty, (this is mainly an experiment on agent orchestration as opposed to building a viable sqlite in rust)
- The choice of two workers per model is purely pragmatic: I can't afford more. - I chose heterogeneous agents because it has not been done yet. There is no performance justification for this choice.
Interesting to compare this to the in-progress project https://github.com/Dicklesworthstone/frankensqlite
Which aims to match SQLite quality and provide new features (free encryption, multiple simultaneous writers, and bitflip resistance.)
That project is definitely of higher quality than this one. For instance, this project does not have concurrency.
If anyone is looking for ideas for these projects - it’d be great to be able to run macos applications on linux…
Someone could have a swarm of agents build “wine for macos apps”.
I cant quite tell if the tests that passed were sqlites own famously thorough test suite, or your own.
If its sqlites suite then its great the models managed to get there, but one issue (without trying to be too pessimistic) is that the models had the test suite there to validate against. Sqlites devs famously spend more of their time making the tests than building the functionalities. If we can get AI that reliably defines the functionality of such programs by building the test suite over years of trial and error, then we'll have what people are saying
Sorry for the ambiguity, it's not the test suite and I have updated the blog post to make that clear. I agree that building software where you do have this oracle is much easier than not having it and expecting the AI to build it.
Take a look at SQLite’s test coverage. It’s impressive: https://sqlite.org/testing.html
590x the application code
this project certainly fails against the official sqlite test suite, so I did not bother testing
The fact that AI agents can even build something that purports to be a working database is also impressive.
A small, highly experienced team steering Claude might be able to replicate the architecture and test suite reasonably quickly.
1-shotting something that looks this good means that with a few helping hands, small teams can likely accomplish decades of work in mere months.
Small teams of senior engineers can probably begin to replicate entire companies worth of product surface area.
Apologies for the snark, but are you also impressed by `git clone` downloading a repository that is openly available on the internet?
It can even do that in a loss-less way, instead of burning a bunch of tokens to get a bad, barely working half-copy.
Don't get me wrong, I'm no AI hater, they are an impressive technology. But both AI-deniers and hypers need a reality check.
I think we are moving into a regime of really fast software iteration, but there could also be a slowing down of progress. Only time will tell
The other day I asked AI to one-shot an implementation of hyperbolic trig functions for double-double floats.
I provided a repo (mine) that already implemented double-double arithmetic, trigonometry, and logarithms/exponentials, with plenty of tests.
It produced something that looked this good. It had tests, it followed the style of the existing code base, etc. But it was full of shit and outright lies.
After I reviewed it to fix deficiencies, I don't think there was anything left of the original.
I had much more success the previous week using an AI to rubber duck the algorithms to implement trig.
I am incredibly sceptical that just adding more loops — and less critical thinking/review — to brute force through a solution, is a good idea.
I push back on loops being insufficient because algorithms such as alpha evolve have already proved very effective.
I believe so
Will this work be of any use tho?
not initially, but it's the genisis
Would be better to choose a small subset of functionality and get that working as well as sqlite (or better) Then iterate that way. Context size is too small to work on such a large system.
I agree with you that choosing a subset and iterating to get an optimized version as good as SQLite or better is a better way to test and achieve more useful results. But with respect to the context size, Cursor has made agentic projects with over one million lines of code, so I would push back on that: https://cursor.com/blog/scaling-agents
Why would you need 6 different models running across three providers? Just have a single one running, then you avoid all this nonsense around locking.
And this is ultimately pointless, because it’s just a shitter SQLite. It’s nothing new. If you’re going to build something big like this, there needs to be a real business case
You could already slop out a replica of SQLite if you wanted. But you don’t, because of the effort it would take to test and maintain it.
Ultimately, this was an experiment with no intent to migrate to a production environment. Regarding single agents, for large projects one agent is too slow. So that's why developing multi-agent paradigms is compelling.
I view SQLite as just an objective to attain and optimize for, but nothing more. I agree 100% that this is just a shittier SQLite.
Not a single comment about whether it actually works or not?
Sorry full transparency, I put my confidence in the fact that the model said it was passing all tests and had implemented most SQLite operations, but that was a mistake, so now I'm independently running tests.
It largely doesn't. The authors didn't attempt to run against SQLite's open test suite.
You are right, I'm rectifying that now
I'm a heavy Cursor user (not yet on Claude) and I see a big disconnect between my own experience and posts like this.
* After a long vibe-coding session, I have to spend an inordinate amount of time cleaning up what Cursor generated. Any given page of code will be just fine on its own, but the overall design (unless I'm extremely specific in what I tell Cursor to do) will invariably be a mess of scattered control, grafted-on logic, and just overall poor design. This is despite me using Plan mode extensively, and instructing it to not create duplicate code, etc.
* I keep seeing metrics of 10s and 100s of thousands of LOC (sometimes even millions), without the authors ever recognizing that a gigantic LOC is probably indicative of terrible heisenbuggy code. I'd find it much more convincing if this post said it generated a 3K SQLite implementation, and not 19K.
Wondering if I'm just lagging in my prompting skills or what. To be clear, I'm very bullish on AI coding, but I do feel people are getting just a bit ahead of themselves in how they report success.
I don't think I spoke about the fact that yeah, the code quality is suboptimal and this is purely a proof of concept. So I'm going to update the blog post with that information, but I completely agree with you that the code you get with models is not best practices and this is even more so the case when you have many agents on one project that generate lots of redundancy (which I do cover in the blog post).
This has been my experience also, but i've been using everything (Claude code, open code, copilot, etc...) It's impressive when I ask it to do something I don't know how like some python apps, but when it's in my stack I have to constantly stop it mid processing and ask it to fix something. I'm still validanting the plan and rewriting a lot of the code because the quality just is not there yet.
And for the most part I use either opus or sonnet, but for planning sometimes I switch to chatgpt since I think claude is too blunt and does not ask enough questions. I also have local setups with OLlama and have tried for personal projects some kimi models. The results are the same for all, but again claude models are slighly better.
> cleaning up what Cursor generated
What model? Cursor doesn't generate anything itself, and there's a huge difference between gpt5.3-codex and composer 1 for example.
Well, I've got it as Auto (configured by my company and I forget to change it). The list of enabled models includes claude-4.6-opus-high, claude-4.5-sonnet, gpt-5.3-codex, and a couple more.
That is probably Composer-1, which is their in-house model (in so much a fine-tune of an open-weights model can be called in-house). It's competent at grunt work, but it doesn't compare to the best of Claude and Codex; give those a shot sometime.
this is the business model bet. the codebase is a big ball of mud that only a superhuman ai can comprehend, therefore everyone must use superhuman ai make changes in the codebase. the selling point is iteration speed, especially early iteration speed
cf. SV conventional wisdom: he who ships first wins the market
in fairness, there is real value in iteration speed. i'm not holding my breath on human comprehensible corporate code bases moving forward. a slew of critical foundational projects, mostly run by the big names, may still care about what used to be called "good engineering practices".
Did they pass all unit tests in the end ?
not the sqlite test suite unit tests, this project certainly fails against the official sqlite test suite, so I did not bother testing
It doesn’t matter, just jump on the hype train!
or jump off, and instead grab onto the (well-deserved) sqlite-test-suite hype train.
(I'm being sarcastic.)
Who cares?
I hope someone
Why do you think that it is a good idea to make it public ? It is obviously half hallucinated mostly broken unusable piece of low effort (on human part), with as much value as blurry image generated with stable diffusion that people now widely consider bad taste and slop.
I hope I did not give the impression that I wanted people to actually use this. I'm just using this as a test bench similar to how Anthropic made a C compiler with Claude, which of course they do not recommend you use.
> There isn’t a great way to record token usage since each platform uses a different format, so I don’t have a grasp on which agent pulled the most weight
lol
Claude code token tracking doesn't even work, for example. And Gemini also doesn't provide statistics, so I'm just being honest here.
Comment was deleted :(
[dead]
Crafted by Rajat
Source Code