Something else to note here is that the Python interpreter is slow compared to native code, so if you're parsing a lot of small JSON files, whatever difference in speed the JSON parser makes is masked somewhat by the interpreter overhead (assuming you're doing something with that JSON you just parsed). When you're using a JIT like PyPy, the overhead of the rest of your code becomes less, and so then the bottleneck could become something like JSON parsing, I/O, etc.
This is the main bottleneck in pysimdjson[1]. 95-99%[2] of the time is spent creating Python objects rather than parsing the documents.
There was no trick I could find (although treating all strings as bytes instead of using PyUnicode_DecodeUTF8/unicode_decode_utf8 was a fairly large gain on string-heavy documents) to reduce the cost of bringing the document into Python land, so I added a quick hack to push the filtering down into C, since most of my use cases only need a small part of the full document.
I thought a bit about this - I think you could do a lot faster if you had a 'bulk' method to create lots of Python objects in one hit. I don't know it that's feasible, as I don't know much about the innards of Python, but there's no reason one couldn't keep goofing around with SIMD code and branch-free stuff all the way until you've created objects.
This may be ferociously unmaintainable (i.e. depend on things that Python implementations don't advertise as a part of the API) but I don't know. I doubt you're the only one to have this problem.
Like you said, it would unfortunately be unmaintainable, requiring a fork of CPython. Aggressive optimizations like this could go into pypy, but IMO CPython is still trying to be a "reasonably readable" reference implementation.
I think the time is currently best spent trying to push more of the work back into native code, extending the simple filtering that's in there now to also handle mutation (similar to jq), reducing the amount of back-and-forth with the interpreter.
This whole discussion makes me thing that perhaps there would be a point in parsing JSON into a "JSON database" object instead of a tree of individual hashmaps. In theory it would allow for a more memory efficient representation and more faster search or transformation operations.
> so if you're parsing a lot of small JSON files, whatever difference in speed the JSON parser makes is masked somewhat by the interpreter overhead
Does this vary based on the size/quantity of the input JSON files? I would expect this to hold regardless of size/quantity of the input files? Further, isn't CPython's implementation already native code? If so, I would expect the bottleneck to be allocating the output objects or pressure on the GC or something?
Of course, the smaller the JSON file, the shorter the amount of time spent in native code, and proportionally more time will be spent in the interpreter around whatever looping construct is used to feed data into json.loads (and presumably process that data).
# interpreter time spent here looping
for file_name in stuff:
# interpreter: do stuff with file name, load file into string, etc.
str_data = ...
# interpreter: dispatch call to native function
# native: actually perform json.loads
data = json.loads(str_data)
# interpreter: time spent processing the data
process_data(data)
# what happens as native time goes to zero (json parse time decreases)?
# answer: interpreter time dominates
So "interpreter time" is defined as "time not spent parsing JSON" and thus "the less time you spend parsing JSON the less time you spend parsing JSON"? Of course that's correct, but it's so obvious as to be unsubstantial. Surely the OP is making a more interesting point...
No, that's it - for most json documents, the time the interpreter spends in all the stuff surrounding (and executing) json.loads dominates the time spent in native code parsing. That's why replacing json.loads with a faster native implementation doesn't result in noticeable improvements - it's speeding up the fastest bit, and if it's already 0.1% of the wall clock time, then you can't get much more than about 0.1% faster.
PyPy is a JITed language though, so the interpreted bits are much faster. OP is making the point that Python 3/CPython is so slow that speeding up json.loads is a premature (and unnecessary) optimization.