Actually, MCP wastes a lot of tokens when compared to regular tool calling. You might not notice it on more trendy models with large contexts, but for those of us trying to use locked down/local/cheap models it makes very little sense.
Also, MCP creates a new problem: providing the model with too much context when trying to combine tools across multiple servers. It works OK with small, very focused servers (like helpers for a specific data set), but if you try to mix and match servers things get out of hand really quickly and the entire workflow becomes very unreliable—too many options to digest and pursue, just like humans.
Is that just bad implementation? Where are the wasted tokens?
I noticed your second issue, but to me it's just from bad implementation. For some reason people keep exposing generic overlapping tools from multiple MCP servers.
I don't know that MCP causes this issue, any vendor offering a "tools API" if they shove to many APIs it would bloat things up.
Here's what Anthropic has to say about it:
As MCP usage scales, there are two common patterns that can increase agent cost and latency:
Tool definitions overload the context window;
Intermediate tool results consume additional tokens.
[...]
Tool descriptions occupy more context window space, increasing response time and costs. In cases where agents are connected to thousands of tools, they’ll need to process hundreds of thousands of tokens before reading a request.
[...]
Most MCP clients allow models to directly call MCP tools. For example, you might ask your agent: "Download my meeting transcript from Google Drive and attach it to the Salesforce lead."
The model will make calls like:
TOOL CALL: gdrive.getDocument(documentId: "abc123")
→ returns "Discussed Q4 goals...\n[full transcript text]"
(loaded into model context)
TOOL CALL: salesforce.updateRecord(
objectType: "SalesMeeting",
recordId: "00Q5f000001abcXYZ",
data: { "Notes": "Discussed Q4 goals...\n[full transcript text written out]" }
)
(model needs to write entire transcript into context again)
Every intermediate result must pass through the model. In this example, the full call transcript flows through twice. For a 2-hour sales meeting, that could mean processing an additional 50,000 tokens. Even larger documents may exceed context window limits, breaking the workflow.
With large documents or complex data structures, models may be more likely to make mistakes when copying data between tool calls.
Now, if you were to instead have the LLM write code, that code can perform whatever filtering/aggregation/transformation etc that it needs, without having to round-trip from LLM to tool(s), back and forth, and the only tokens that are consumed are those of the final result. What happens with MCP? All of the text of each MCP call is flooded into the context, only for the LLM to have to make sense of what it just read to then either regurgitate that out into a file to post process (very likely with differences/"hallucinations" slipped in), or in the usual case (I'm personifying the LLM here for rhetorical purposes) it simply tries to reason about what it read to give you the filtered/aggregated/transformed/etc result you're looking for -- again, very likely with mistakes made.
Also, MCP creates a new problem: providing the model with too much context when trying to combine tools across multiple servers. It works OK with small, very focused servers (like helpers for a specific data set), but if you try to mix and match servers things get out of hand really quickly and the entire workflow becomes very unreliable—too many options to digest and pursue, just like humans.