Something else to note here is that the Python interpreter is slow compared to n...

TkTech · on Oct 8, 2019

This is the main bottleneck in pysimdjson[1]. 95-99%[2] of the time is spent creating Python objects rather than parsing the documents.

There was no trick I could find (although treating all strings as bytes instead of using PyUnicode_DecodeUTF8/unicode_decode_utf8 was a fairly large gain on string-heavy documents) to reduce the cost of bringing the document into Python land, so I added a quick hack to push the filtering down into C, since most of my use cases only need a small part of the full document.

[1]: https://github.com/TkTech/pysimdjson [2]: https://github.com/TkTech/pysimdjson/issues/22

glangdale · on Oct 8, 2019

(one of the original authors of simdjson here)

I thought a bit about this - I think you could do a lot faster if you had a 'bulk' method to create lots of Python objects in one hit. I don't know it that's feasible, as I don't know much about the innards of Python, but there's no reason one couldn't keep goofing around with SIMD code and branch-free stuff all the way until you've created objects.

This may be ferociously unmaintainable (i.e. depend on things that Python implementations don't advertise as a part of the API) but I don't know. I doubt you're the only one to have this problem.

TkTech · on Oct 8, 2019

Like you said, it would unfortunately be unmaintainable, requiring a fork of CPython. Aggressive optimizations like this could go into pypy, but IMO CPython is still trying to be a "reasonably readable" reference implementation.

I think the time is currently best spent trying to push more of the work back into native code, extending the simple filtering that's in there now to also handle mutation (similar to jq), reducing the amount of back-and-forth with the interpreter.

ufo · on Oct 8, 2019

This whole discussion makes me thing that perhaps there would be a point in parsing JSON into a "JSON database" object instead of a tree of individual hashmaps. In theory it would allow for a more memory efficient representation and more faster search or transformation operations.

weberc2 · on Oct 8, 2019

> so if you're parsing a lot of small JSON files, whatever difference in speed the JSON parser makes is masked somewhat by the interpreter overhead

Does this vary based on the size/quantity of the input JSON files? I would expect this to hold regardless of size/quantity of the input files? Further, isn't CPython's implementation already native code? If so, I would expect the bottleneck to be allocating the output objects or pressure on the GC or something?

AaronFriel · on Oct 8, 2019

Of course, the smaller the JSON file, the shorter the amount of time spent in native code, and proportionally more time will be spent in the interpreter around whatever looping construct is used to feed data into json.loads (and presumably process that data).

    # interpreter time spent here looping
    for file_name in stuff:
        # interpreter: do stuff with file name, load file into string, etc.
        str_data = ...
        # interpreter: dispatch call to native function
        # native: actually perform json.loads
        data = json.loads(str_data)
        # interpreter: time spent processing the data
        process_data(data)

    # what happens as native time goes to zero (json parse time decreases)?
    # answer: interpreter time dominates

weberc2 · on Oct 8, 2019

So "interpreter time" is defined as "time not spent parsing JSON" and thus "the less time you spend parsing JSON the less time you spend parsing JSON"? Of course that's correct, but it's so obvious as to be unsubstantial. Surely the OP is making a more interesting point...

AaronFriel · on Oct 9, 2019

No, that's it - for most json documents, the time the interpreter spends in all the stuff surrounding (and executing) json.loads dominates the time spent in native code parsing. That's why replacing json.loads with a faster native implementation doesn't result in noticeable improvements - it's speeding up the fastest bit, and if it's already 0.1% of the wall clock time, then you can't get much more than about 0.1% faster.

PyPy is a JITed language though, so the interpreted bits are much faster. OP is making the point that Python 3/CPython is so slow that speeding up json.loads is a premature (and unnecessary) optimization.