Show HN: Pydantic – Data validation using Python 3.6 type hinting

svisser · on June 3, 2017

For those looking to validate dictionaries / JSON responses in Python, the voluptuous library works quite well: http://github.com/alecthomas/voluptuous. It also works for lists and other data types.

Varriount · on June 4, 2017

I use a fork/successor library, good[0], for configuration validation. I've especially liked the data transformation it can do (I can easily allow a configuration entry to be a single value, or list of values, and transform that to always be a list)

[0] https://github.com/kolypto/py-good

user5994461 · on June 3, 2017

JSON has a well defined schema system that works across languages. http://json-schema.org/

Python comes with the jsonschema module out of the box.

calvins · on June 3, 2017

What do you mean by 'out of the box'? jsonschema needs to be installed. It's not in the stdlib.

stavros · on June 3, 2017

That's rather verbose, though. I prefer the much more functional "schema" library:

https://github.com/keleshev/schema

ris · on June 4, 2017

Thing is, none of these are language agnostic. jsonschema is the only one that makes an attempt at that.

It's a slight pity that the python jsonschema package doesn't support some of the more powerful recent features though.

vog · on June 3, 2017

Recently I had a similar wrote a library that creates (and dumps) your typed NamedTuples, datetimes and similar objects from simple JSON, using type annotations:

"JSON support for named tuples, datetime and other objects, preventing ambiguity via type annotations"

https://github.com/m-click/jsontyping

If you are interested, please have a look at the first unit tests to see how it works:

https://github.com/m-click/jsontyping/blob/master/tests/test...

Note that the tests currently use the "ugly" NamedTuple syntax to be compatible with Python 3.5 and 2.7.

thristian · on June 4, 2017

I've recently been using attrs as an easy way to make simple datatypes, but its only gesture towards validation is an arbitrary callback per field. Hooking into Python 3 type annotations is a great idea!

Does/will Pydantic handle all the standard dunder fields like __eq__, __lt__, __hash__, __cmp__ and faux-immutability like namedtuple and attrs do?

scolvin · on June 4, 2017

"faux-immutability" is a good way to describe it. My problem is the "faux"; proper immutability is virtually impossible in python and I'm not convinced about providing partial immutability and thereby giving people a false sense of security. That said there's an issue about it: https://github.com/samuelcolvin/pydantic/issues/38. I'll consider it if we can find a performant and elegant way of using it.

__eq__ makes sense, I'll do it when I get round to it

__hash__ would be nice but far from simple to do in a performant way.

__lt__ I'm not sure what this would mean?

__cmp__ no longer exists in python 3.

noisy_boy · on June 4, 2017

Do you have a writeup/blog post on your approach? I would like to read more about it.

thristian · on June 4, 2017

In the small, it's just using attrs the way it's described in its documentation.

In the large, I've been learning about Rust recently and wrapping my head around the design-patterns of static typing. For internal data-structures the benefit is not as clear, but for serialising and deserialising external data (like from config files or JSON APIs) I really prefer having specific, named types instead of a generic bucket of dicts.

API documentation can be more concise. You can say "this argument must be an instance of BuildArtifact" rather than "this argument must be a dict with an 'href' key whose value is the URL to a build artifact and a 'hash' key whose value is the SHA256 of that artifact" in every relevant API.

Debugging is easier when inspecting a variable starts with "<BuildArtifact ...>" rather than just dumping a dict at you.

If you need to operate on a particular kind of data, a named class gives you an obvious place to hang a method, instead of having a loose function rattling about. For operations between two data-types (like 'merge' or 'intersection'), a loose function might still be the most appropriate, but operations like searching or summarizing are naturally methods.

scolvin · on June 4, 2017

I don't. I'm not sure it would be that interesting.

In short:

* I started out with a class that substituted environment variables into itself for settings, still in aiohttp-devtools: https://github.com/aio-libs/aiohttp-devtools/blob/master/aio...

* using annotations occurred to me

* it worked

* it was 50% faster than trafaret which I'd been using before

* I added some unit tests, published to pypi and started using it.

cidnurh · on June 4, 2017

Interesting project! I'm collecting different ORM/ODM/Mappings for python in this repository: https://github.com/grundic/awesome-python-models. Added your library. Thanks!

juni0r · on June 3, 2017

I usually use PyComb

mkesper · on June 4, 2017

Without telling why that post is rather meaningless.

sametmax · on June 3, 2017

Marshmallow is still the best lib in town. Indeed, most of the libs fall short when you start to use them in the real world. Where validation is more than a type, when fields are dependants on each others, when data is generated on the fly post validation and where you need all that to cascade down your nested, sometime recursive, data structure validation, which then should produce equally complex error messages.

A data validation framework is not a toy project.

scolvin · on June 3, 2017

Pydantic is not just a toy, I built it having used and abused numerous other libraries and found them wanting in one way or another.

Because it reuses python's typing system it should have the most pythonic and flexible description of types possible.

I agree about the need for complex validation chains relating to numerous fields, that's already partially possible with pydantic (although not documented). I'll add support for this stuff as well as documentation over the next few weeks.

zepolen · on June 4, 2017

Having used both colander and marshmallow extensively - I prefer colander mainly because it has first class explicit handling of null, missing and required values and it's support of nested and inheritance is also much nicer than marshmallow.

Varriount · on June 4, 2017

I've recently been using good[0] , which also allows for minor data transformation. Looking at Marshmallow, it doesn't seem like it allows inline declaration of nested schemas - each level of the schema needs its own data type.

[0] https://github.com/kolypto/py-good

rukittenme · on June 3, 2017

I agree. Marshmallow is the best game in town and I love using it. That said its not perfect and I'm eagerly awaiting a successor. I use it frequently and have run into a lot of cruft.

sametmax · on June 3, 2017

I would love to read about it.

rukittenme · on June 3, 2017

Sure, my complaints are mostly around error handling. My comments might not be accurate since its off the top of my head.

When a field raises an error I've noticed the value gets replaced with a "None" type (or maybe its removed from the passed data object) when using @validates_schema. This is annoying because I have to pass the original data and see if the value is actually null or not. This can suck when using JSONAPI because you have to be super careful with your data extraction (eg. data.get('data', {}).get('relationships', {})...etc).

I would like more control over how validation is executed. The ordering of validation, the ability to stop validation or continue validation at arbitrary points. Maybe in my validator I could do something like "raise ValidationError(msg, stop_validation=True).

I would like more control over pre and post dump/load order. Sort of like a z-index in css. So when using marshmallow-jsonapi, I can specify pre_loads that access the data before and after the jsonapi pre_load formatting.

I would like a better way of using "class Meta". Right now its annoying to inherit from a base then define additional "class Meta". I think I ended up subclassing SchemaOpts and settings my defaults that way on my base schema.

I would like a way to replace error messages so instead of "Missing data for required field", it would say "Please specify data" (or whatever). I want to define this at the schema level too. I don't want to have to constantly define fields with the same behavior everywhere. It would be cool if Marshmallow had a dictionary of error codes and messages. So it would look like {'1': 'Missing data for required field.'} and I could override that error with "errors['1'] = msg".

If I had to sum it up. I like to write small functions and classes that have limited uses. A lot of times I feel like Marshmallow pushes me into more monolithic work so I can control the flow.

fuhrysteve · on June 3, 2017

I believe marshmallow v3, which is not yet released, claims to solve the error handling / swallowing exception issue.

I use marshmallow all the time and love it - I agree that error handling is its weakest area, even using strict mode.

sametmax · on June 4, 2017

Thanks that was very interesting.

weberc2 · on June 3, 2017

I wish the people who designed APIs would constrain themselves to libraries that could be validated as types. I rarely see insane data structures for APIs built in static languages, and I say this because my company has some absurd APIs which would not be so bad if the developers were constrained a bit.

sametmax · on June 3, 2017

You can't. You need permission validation, data integrity check, time sensitivity check, caching, etc. for any decent crud app nowaday.

Varriount · on June 4, 2017

What about good[0]?

[0] https://github.com/kolypto/py-good