这是上周看到的一篇文章 Why LLMs Can’t Really Build Software 链接 https://zed.dev/blog/why-llms-cant-build-software 最近也有跟朋友讨论我们从业者关注的到底LLM能在软件研发领域完成多少工作,这些工作是否真具备交付,以后怎样发展。刚好这篇文章提到一些观点比较认可。
文中就提到软件工程师需要在需求和代码的行为中来回切换和不断迭代,这一点LLM难以做到。文中用构建和维护心智模型来做说明。所以正是在这个核心能力上存在根本性缺陷,因此LLM只能作为辅助工具,而非独立的软件工程师。
以下为原文和翻译,文末附上我的评论:
One of the things I have spent a lot of time doing is interviewing software engineers. This is obviously a hard task, and I don’t claim to have a magic solution; but it’s given me some time to reflect on what effective software engineers actually do.
我曾花费大量时间面试软件工程师。这显然是一项艰巨的任务,我也不敢说自己有什么灵丹妙药;但这确实给了我一些时间来反思,高效的软件工程师究竟在做什么。
The Software Engineering Loop
When you watch someone who knows what they are doing, you’ll see them looping over the following steps: Build a mental model of the requirements
Write code that (hopefully?!) does that
Build a mental model of what the code actually does
Identify the differences, and update the code (or the requirements). There are lots of different ways to do these things, but the distinguishing factor of effective engineers is their ability to build and maintain clear mental models.软件工程迭代
(原文Loop我认为翻译为迭代合适)
当你观察一个真正懂行的人时,你会发现他们总在循环执行以下几个步骤:
构建一个关于需求的心智模型(mental model)。
编写能够实现需求的代码。 构建一个关于代码实际行为的心智模型。
识别两者之间的差异,并更新代码(或可能调整需求)。
完成这些步骤的方式多种多样,但高效工程师的突出特质在于他们能够建立和维护清晰的心智模型。
How about LLMs?
To be fair, LLMs are quite good at writing code. They’re also reasonably good at updating code when you identify the problem to fix. They can also do all the things that real software engineers do: read the code, write and run tests, add logging, and (presumably) use a debugger.
But what they cannot do is maintain clear mental models.
LLMs get endlessly confused: they assume the code they wrote actually works; when test fail, they are left guessing as to whether to fix the code or the tests; and when it gets frustrating, they just delete the whole lot and start over.
This is exactly the opposite of what I am looking for.
Software engineers test their work as they go. When tests fail, they can check in with their mental model to decide whether to fix the code or the tests, or just to gather more data before making a decision. When they get frustrated, they can reach for help by talking things through. And although sometimes they do delete it all and start over, they do so with a clearer understanding of the problem.
大语言模型是怎么做的?
平心而论,LLM 相当擅长编写代码。当你指出问题所在时,它们在更新代码方面也做得相当不错。它们还能做所有真人软件工程师会做的事:阅读代码、编写并运行测试、添加日志,以及(据推测)使用调试器。
但它们无法做到的,是维护清晰的心智模型。
LLM 会陷入无休止的困惑:它们会假设自己写的代码能够正常工作;当测试失败时,它们只能猜测是该修复代码还是修复测试;当情况变得令人沮丧时,它们会干脆删掉所有东西,从头再来。
这与我所期望的恰恰相反。
软件工程师会在工作中边做边测试。当测试失败时,他们可以对照自己的心智模型来判断是该修复代码还是测试,或者在做决定前先收集更多信息。当他们感到沮丧时,可以通过与人交流来寻求帮助。而且,尽管他们有时确实会删掉一切从头开始,但这样做时,他们对问题已经有了更清晰的理解。
But soon, right?
Will this change as models become more capable? Perhaps?? But I think it’s going to require a change in how models are built and optimized. Software engineering requires models that can do more than just generate code.
很快就行了,对吧?
随着模型能力越来越强,这种情况会改变吗?也许吧??但我认为这需要改变模型的构建和优化方式。软件工程需要的模型,其能力必须超越仅仅生成代码。
When a person runs into a problem, they are able to temporarily stash the full context, focus on resolving the issue, and then pop their mental stack to get back to the problem in hand. They are also able to zoom out and focus on the big picture, allowing the details to temporarily disappear, diving into small pieces as necessary. We don’t just keep adding more words to our context window, because it would drive us mad.
当一个人遇到问题时,他们能够暂时将完整的上下文“暂存”起来,专注于解决当前问题,然后再“弹出”心智堆栈,回到手头的工作上。他们还能够跳出来关注宏观全局,暂时忽略细节,在必要时再深入到微小的部分。我们不会只是一味地向我们的上下文窗口中添加更多词语,因为那会把我们逼疯。
Even if it wasn’t just too much context to deal with, we know that current generative models suffer from several issues that directly impact their ability to maintain clear mental models: Context omission: Models are bad at finding omitted context.
Recency bias: They suffer a strong recency bias in the context window.
Hallucination: They commonly hallucinate details that should not be there.
These are hopefully not insurmountable problems, and work is being done on adding memory to let them perform similar mental tricks to us. Unfortunately, for now, they cannot (beyond a certain complexity) actually understand what is going on. They cannot build software because they cannot maintain two similar “mental models”, identify the differences, and figure out whether or not to update the code or the requirements.
即使上下文没有多到无法处理,我们也知道当前的生成式模型存在几个问题,这些问题直接影响了它们维护清晰心智模型的能力:
上下文遗漏:模型不擅长发现被遗漏的上下文。
新近度偏见:它们在上下文窗口中存在严重的新近度偏见。
幻觉:它们常常会幻想出本不该存在的细节。
这些问题有望被解决,目前已经有研究在为模型添加记忆,让它们能执行类似我们人类的思维技巧。但不幸的是,就目前而言,一旦超出某个复杂度,它们就无法真正理解到底发生了什么。
它们无法构建软件,因为它们无法同时维护两个相似的“心智模型”,找出其间的差异,并决定是该更新代码还是更新需求。
So, what now?
Clearly LLMs are useful to software engineers. They can quickly generate code, and they are excellent at synthesizing requirements and documentation. For some tasks this is enough: the requirements are clear enough, and the problems are simple enough, that they can one-shot the whole thing.
那么,现在呢?
显然,LLM 对软件工程师很有用。它们可以快速生成代码,并且在整合需求和文档方面表现出色。对于某些任务来说,这已经足够了:需求足够清晰,问题也足够简单,它们可以一次性搞定。
That said, for anything non-trivial, they are not capable of maintaining enough context accurately enough to iterate to a working solution. You, the software engineer, are responsible for ensuring that the requirements are clear, and that the code actually does what it purports to do.
At Zed we believe in a world where people and agents can collaborate together to build software. But, we firmly believe that (at least for now) you are in the drivers seat, and the LLM is just another tool to reach for.
话虽如此,对于任何有一定复杂度的任务,它们都没有能力足够精确地维护足够的上下文,来通过迭代得到一个可行的解决方案。你作为软件工程师,有责任确保需求是清晰的,并确保代码的实际行为与预期一致。
在 Zed,我们相信未来人类和Agent可以协同构建软件。但是,我们坚信(至少目前如此)你仍在驾驶位上,而 LLM 只是一个可以随手使用的工具。
评论
这篇文章是对“当前”LLM能力的快照。AI领域的发展速度极快。作者提到的问题,如记忆、上下文管理,正是学术界和业界正在全力攻克的方向。像RAG(检索增强生成)、长上下文窗口技术、以及更复杂的Agent架构,都是为了缓解这些问题。目前AI发展还是比较快,文章的结论在一年后、甚至几个月后再回顾来看,是否能继续保持这个结论,需要再观察。
文章主要批评的是单个LLM本身。现在我们叫做Vibe Coding , 实际上接下来应该发展到Vibe Working,但未来的AI软件工程师很可能是一个由多个专门化AI Agent组成的系统。例如:
- 一个规划Agent负责理解需求并拆解任务。
- 一个编码Agent负责编写代码。
- 一个测试Agent负责编写和执行测试。
- 一个调试Agent在测试失败后,分析日志和代码,定位问题。
- 一个反思Agent负责在多次失败后,重新评估规划。 这个系统整体上可以模拟出文章所说的“软件工程迭代”,即使单个Agent不具备完美的“心智模型”。
利益相关:作者来自Zed(也是一种IDE),他们的产品理念是“人与智能体协作”。因此,强调“人类处于驾驶座”的观点完全符合其公司的产品定位和商业利益。这并非说观点是错的,但应意识到其背后可能存在的立场。
总结:AI(LLM)能帮助到你,但能帮助多少,是你主导工作还是AI主导工作问题,不仅是这篇作者关心的,也是我们从业者要关心和需要客观认知的。眼下能帮你写代码,也能串起来做一些工作流的事,但随着发展整体的工作只是时间问题,这方面研究还是比较积极的。