Hacker News new | past | comments | ask | show | jobs | submit login
TileDB: Storing massive dense and sparse multi-dimensional array data (tiledb.io)
180 points by rajnathani on Oct 26, 2017 | hide | past | favorite | 72 comments



The underlying technology [1] for TileDB came up with MIT and Intel working together.

And TileDB, the company, formed out of it, has recently received funding [2] from Intel in Intel's latest tranche of investments.

[1] https://people.csail.mit.edu/stavrosp/papers/vldb2017/VLDB17...

[2] https://techcrunch.com/2017/10/19/data-is-the-name-of-the-ga...


Good to know private shareholders are getting even richer off all that taxpayer research money that's pumped into MIT.


Stavros from TileDB, Inc. here: This comment gives us the opportunity to explain a bit the history of the company and our thesis. TileDB was my research project while I was working as a full-time researcher at Intel Labs and was stationed at MIT. This was part of a collaboration between Intel Labs and MIT, called Intel Science and Technology Center (ISTC) for Big Data. Intel funded MIT (and several other universities) for 5 years, while mandating that any tech produced in the center should be open-source with a non-restrictive license (like our MIT license), and without any strings attached. So ISTC was for pure research. We created TileDB, Inc. so that we can continue working on TileDB beyond the graceful termination of the ISTC, since we believed that the tech (and future vision) can significantly contribute to the scientific community (there was already some early evidence through our work with the Broad Institute on genomics). The company will continue to respect and contribute to the open-source community, while trying to bring together very talented people to transition the tech from a research project to a production-ready software.


Given that this is an open-source project, Intel's involvement doesn't seem too sinister to me.


I think it was MITs involvement the above poster was concerned about... but the fact that it is open source is indeed relevant!


For posterity sake, it reminds me of MUMPS persistent sparse arrays, but built to scale much larger. This comment is not meant in any way to take away from this achievement, but rather to muse on where we were and where we're going.

It seems like the days we were stuck in particular or limited ways of thinking about databases/persistent storage are finally well and truly behind us! Now we have many awesome tools to choose from, better to have more tools in the belt than less.


Haha, fully agree! It's good to separate MUMPS the language from MUMPS the storage layer / database which was innovative and pioneering in many ways. I don't think we need to re-hash MUMPS the language :)


Agreed! :)


Looks interesting... Here's the publication that it appears to have formed from: https://dl.acm.org/citation.cfm?id=3025117 (VLDB 2016).


Also available directly here: [PDF, VLDB 2017] https://people.csail.mit.edu/stavrosp/papers/vldb2017/VLDB17...


I would like to see some information on compression ratios. Compression can often significantly improve performance (less data to load/save, more in memory or cache) and esp dense arrays should compress easily. They mention that compression is supported but I'm not sure if that's done internally or just applied to the files they create.


The Learn More section has a very nice description of the concepts. The compression page [0] seems to say that they group each attribute for a fragment together and then apply compression.

[0] https://docs.tiledb.io/v1.0/docs/compression


Floating points compress poorly, though.


Jake from TileDB, Inc. wenc and srean are right. techniques such as those used in zfp and fpzip wenc mentioned are also used to compress real world las file (point cloud) datasets. For the moment we are only focused on lossless compression (scientists are paranoid about losing data), but there is definitely room to explore integration with lossy compression as well. Machine learning applications often do not need full precision so intelligent forms of lossy compression are useful.

Another cool research application of TileDB that extends the storage library with the VP9 codec can be seen here: https://homes.cs.washington.edu/~magda/papers/haynes-sigmod1...


You can get very good lossless compression with floating point numbers, Facebook's Gorilla paper comes to mind. I usually use it for delta-of-delta encoding which provides very high compression for time series. While that won't really help in your case, their floating point encoding could help compressing matrices quite efficiently.

http://www.vldb.org/pvldb/vol8/p1816-teller.pdf [page 5]


There have been a few responses along the lines of 'not always' but what you say is indeed largely true.

There is another thing that is worth considering and that is the algorithms (and even the theory) that works well for compression of discrete sources are not well suited for compressing real numbers (floating point numbers aren't, but, they are the poor man's reals). On the theory side, this bothered Claude Shannon enough that he decided to revisit this later in his career to create rate distortion theory, he knew that there was some unfinished business in information theory.

We do have sort of a chicken and egg problem here, especially when we want store a lot floating points for a ML workload. Learning how to compress and learning the underlying distribution are equivalent problems. If we have already learned the model, then yes we could compress the data well. But when we haven't, then by definition we wouldn't have the knowledge to do a good job of storing the data in a well compressed form. After we have acquired the knowledge to compress well, we don't really need the compressed data anymore to learn the model, we already have it. One way to address this would be to do both incrementally and simultaneously.


Not always [1].

Also in many time-series applications involving sensors whose readings don't fluctuate that much, process historians often apply deadband compression (i.e. store only one value if it is within a certain band). The type of compression is lossy and sometimes a bit controversial for high-fidelity uses, but often results in efficient storage.

[1] zfp, fpzip: https://computation.llnl.gov/projects/floating-point-compres...


Not always poorly -- especially if you preprocess by calculating running diffs of consecutive values.


> running diffs of consecutive values

https://en.wikipedia.org/wiki/Delta_encoding


Anyone know how it compares with Apache Arrow (that pandas is moving to) ?

http://wesmckinney.com/blog/apache-arrow-pandas-internals/

http://arrow.apache.org/


Stavros from TileDB, Inc. here: Arrow employs a columnar format to store objects like data frames. TileDB is also columnar (so you can sub-select attributes and perform analytics with very similar optimizations to Arrow), but the first-class citizens are (the more general) dense and sparse multi-dimensional arrays. Moreover, TileDB focuses on optimizing for the persistent storage backend, so that it can handle out-of-core analytics on massive datasets that cannot fit in main memory (e.g., genomics), or offer the same performance using less RAM (leading to cost savings in the cloud). Nevertheless, we are quite fond of Arrow, so we hope to work together at some point and integrate as seamlessly as possible.


What do you mean by out-of-core?


In-core algorithms require the entire array(s) in main memory to perform some computation. Out-of-core algorithms are typically block-based and stream the array blocks from persistent storage to main memory on demand, working on parts of the array(s) at a time, thus minimizing the memory requirements. If this is done asynchronously and carefully, for some CPU-bound algorithms you may be able to completely hide the storage-to-memory cost, thus saving memory without losing performance.


Arrow is a columnar format. TileDB does clustering across multiple dimensions based. Arrow is great when you request specific columns. TileDb is great when you want rectangular sections of the space (think getting tiles from a map).


How do I slice & dice TileDB with a query? Can TileDB compute aggregations or do I need to copy cells into main program memory out of TileDB and then perform all data manipulation?


Stavros from TileDB, Inc. here: Currently, TileDB is a storage manager, so it offers only efficient slicing. Support for aggregation queries (e.g., similar to what you can do with Pandas) is in our roadmap. We plan to work on it as soon as we ship the Python bindings.


> Python, R, Matlab and Excel

Mentioning Matlab and Excel immediately puts the product in the category "they know what they are doing" as opposed to "another group of sophomores trying to reinvent data science".

I'm still waiting for a raw data dump with "avoid copies at all cost" access for very raw, very verbose vehicle and manufacturing data that is in 99.9 per cent never accessed, but must be analyzed when errors are detected late in the process. I.e. there's practically no transformation to apply during storage, but upon access, transformation must be done.

If TileDB is kind of like a more structured, more low-level struct oriented Redis, it is a very welcome addition.


Apart from a diminishing population of old farts and a clutch of national labs and universities that Mathworks keeps well greased, is MATLAB even relevant anymore ?


Jake from TileDB, Inc. Engineering as a discipline is conservative (for good reason). The tools and the processes they use change slowly. Matlab is still entirely relevant both in the sciences and in industry. There are people with huge amounts of domain knowledge (who may only know Matlab or Excel) that are increasingly called upon to analyze and interpret larger amounts of data. These "old farts" are the people engineering, designing, and debugging our modern world. Empowering people with domain knowledge to answer data driven questions is what the democratization of data science is all about. There is tremendous value in building bridges across communities and across generations here.

I say this as someone who helped in small ways to develop open source alternatives to Matlab.


Stanislav, Seltser, Petacube agree with Jake. Matlab still has some unique features to it which many open source projects (eg python) don't have - eg FPGA integration, System Modeling and Simulation. Combine that with decent language, nice debugger and many strong industry-specific solutions its good bet its will be around for a while. The reasons people dump matlab is not that its overpriced but due to the lack of integration with big data systems and the fact that matlab license cost at large scale becomes untenable.Plus number of industries using these unique matlab features is relatively small.


> These "old farts" are the people engineering and designing our modern world. Empowering people with domain knowledge to answer data driven questions is what the democratization of data science is all about.

BTW I don't begrudge that TileDB has MATLAB support.

I see plenty of modern scientific/engineering workload in my day job. From what I see around myself it is usually a handful of people set in their ways that are holding others back by keeping a ridiculously overpriced tool alive that has comparable if not better alternatives. I gather from comments that it is different in Europe.


So Jake, how are the Julia bindings coming along?


It is used pretty extensively in my industry (industrial manufacturing). There has been some talk about R but not sure it will be widely adopted.

You are right that engineers are conservative which no doubt plays some role. But I'd say technical debt and legacy code plays much more of a role.

Most Engineers (Chem, Mech, Matls etc) are not exposed to code during university. Maybe it is changing now but it is slow process. Often first exposure comes when the engineer enters industry and is asked to work on existing model - usually under supervision of a senior engineer. You learn whatever the language the senior engineer knows and that is typically what they learnt from similar mentorship - it is often Fortran or C/C++. Our Industrial Process has not changed dramatically in last 30 years. So once efficiently written and accurate simulation code exists very little reason to rewrite (if given the choice to rewrite an existing model in whatever the current language du jour is or continue to hack on something that already exists most engineers - at least the ones I know, would chose to hack on existing code). For pretty much this reason there is a heap of Fortan that is still alive in my org (with roots that can be traced back to the '80's). I think it's probably worth mentioning a lot of engineers don't approach code thinking about algorithms - it's equations i.e. I need to write something to solve Bernoulli's equation - or Ergun's equation or similar. Linear programming (i.e solving simultaneous equations) is the other main reason to write code and hey MATLAB does this pretty well...

The other reason for being "gunshy" about new technology at least in my org is that a lot of the senior engineering people still remember when we got bitten by investing in Microsoft "Stack" in the late 90's. There were a lot of Modelling done with VB6 and Access which tuned out to be a technology dead end. Access databases in particular have plagued our org in many cases multiyear effort to migrate something out of Access. Open source probably protects against this but everyone wants to be sure any new technology will still be around in 15 years time. I think that is why people in my org are finally starting to look at R it has passed that initial hurdle and there is some confidence it is not just a fad...


Yeah, those deep-pocketed government lab and university researchers are clearly the dominant revenue source for MathWorks. Everybody in the corporate world is super nimble and moved on to Numpy as soon as they could.


Pervasive in the German automotive industry.


Stanislav Seltser, Petacube _pmf: think of distributed persistent numpy arrays as opposed to key-value pairs in Redis. The idea you will not pull data to do computation, you will push your code in. pretty good for archiving due to compression


anyone have experience using one of these array data stores to handle large amounts of weather forecast data? At my job we've come up with our own clever postgres solution to handle the large amounts of gridded binary data. Essentially we are inserting large 3D arrays (a series of images where each pixel represents a geographic location and a value, like windspeed or temperature or rainfall) into our postgres. Our solution is solid but am always on the lookup for new novel approaches.


Jake from TileDB, Inc. I think this would be an ideal workload for array data stores (NetCDF a standard in this area uses HDF5 under the hood). You have <N> number of attributes that you want per grid-point over time (and you want to append to the time dimension). If you are ingesting Grib2 files then you can take advantage of compression as well. An array data store like TileDB should offer advantages for fast access, as you can get a pointer directly to the stored array and do not have to access the (serialized) data over a socket, especially if you are only interested in a subarray of the dataset.


Hi both, this is exactly something that I’m looking at doing. We’ve got about 10TB of netcdf data coming in everyday and we’re looking for a cost efficient data store to provide fast access to individual grid points. S3 has proven to be too slow.

Any chance I could pick your brains about using either Postgres or TileDB?

Thanks!


Absolutely! Drop us a line at hello@tiledb.io and tell us a little more about the problem you are trying to solve and we can go from there.


Stanislav Seltser, Petacube Inc

ingesting data into postgres makes sense for sparce data but not for dense data because it is waisting a lot of of space due to storage of coordinates with every data point and every weather variable. If you are using NOOA grib2 forecast files those are dense . Not to mention losing compression in postgres. TileDB will store data compressed, the dimension coordinates themselves will be compressed, plus column storage (each NetCDF variable) will make retrieval of dense weather data blazingly fast as oppose to postgres where you will have to scan the whole table


Would it make sense to store a tilemap for a game in this? Normally I use NoSQL for that and just store the tilemap as a JSON array. I need to be able to update and retrieve single cells easily (this is for a backend in a multiplayer game).


I don't know, tiledb seems rather low-level and many databases have geospatial indexing.


I wonder how it compares to SciDB[0] which is also used for storing multidimensional data.

[0]: https://en.wikipedia.org/wiki/SciDB


Jake from TileDB, Inc. In addition to the differences pointed out in the paper, I think SciDB and TileDB are very different philosophically. SciDB is architected very much like a traditional RDMS while TileDB is much more lightweight. SciDB encourages you to use their own query language (AQL), TileDB wants to integrate and extend the high level tools you already use (Python, R, etc.) with as little overhead as possible.


Stanislav Seltser, Petacube kuwze: the following things are fundamentally different: license, software size,data model, sparsity support

SciDB --- license: Affero ,software size: 5GB ,data model: ACID ,focus on: dense ,dimensions: integer ------

TileDB license MIT ,software size <1MB , data model: eventual consistency via fragments , focus on: dense, sparse ,dimensions are integer, floats



Could I use this for deep learning research? It’s common for my models to be anywhere from 100mb to a few gb. Although, I’d find it more useful for reading batched training data.


Stavros from TileDB, Inc. here: TileDB could be useful for storing your training data (in some storage backend of your choice) as well as your intermediate data (as davedx pointed out). But this would make sense only if your data are truly large and cannot fit in main memory. In fact, we are looking forward to seamlessly integrating with systems like TensorFlow, but we would rather wait until we can bring some value to applications with very large storage requirements.


Sounds ideal for storing training/intermediate data for machine learning. Niche competitor for Cassandra in this space?


Most models I've seen have at most a few hundred MB of parameters, have complex connections that can only be modeled as a set of many different 1 or 2D arrays with different sizes. Further, this DB stresses that it handles sparse sparse data, any most ML data is not sparse.

It doesn't seem like the best application. On one of their pages it mentions they ingest BAM records, which is for biological sequences. I'm guessing some DNA storage applications.


> most ML data is not sparse.

This brings up a question: in what fields does one find heavy use of large, sparse matrices that need to be persisted and queryable?

In my mind, sparse matrices typically occur in the context of graphs/relationships, e.g. PageRank, logistic networks, adjacency matrices, etc. They also tend to be a property of Hessian matrices (2nd order derivative for a multivariate system). But typically these are intermediate quantities that are discarded after a computation completes.


Jake from TileDB, Inc. Genomics is a big field where sparse matrix storage is needed. Human genomes are stored as a diff off of a reference, which as you indicated forms a graph which can be represented as a sparse matrix. In other fields of genomics, such as metagenomics, fragments of DNA when analyzed also have a graph like structure.

TileDB supports both dense and sparse arrays. It was designed around the concept of handling sparse arrays but dense arrays can be thought of a degenerate case of sparse array storage in TileDB. For dense arrays tile extents are contiguous and we don't materialize the coordinate values. This way all the concepts are the same and we can capture both use-cases. Sparse annotations to dense array values, such as NA or Null handling can also be captured as a sparse array fragment layered over a backing dense array.

I agree with you that for most use cases, storage will be dense. But it is useful to have one system that can handle both representations efficiently, with the sparse case not added on as an afterthought (it also makes the system simpler).


Stavros form TileDB, Inc. here: Another area of great interest to us is point cloud data, which are essentially 3D points in a super sparse space. In fact, any application dealing with spatio-temporal data (which are 2D/3D/4D data that tend to be sparse and skewed) can take advantage of TileDB.


> most ML data is not sparse.

As an ML guy I can say that sparse is pretty common in ML. Text data, market basket data, graph Laplacians, adjacency matrix, sequence fragment data are all rich in sparse matrix computation and storage operations. For a moment I thought the "not" was a typo. Lack of support for sparse matrices often becomes a serious inconvenience in a tool, very happy that TileDB folks have given thought to the sparsity requirement from get-go


True. Text and high cardinality categorizations often deal with sparse data. I was really considering image and video, where it isn't really used.


This looks very interesting. Currently we are storing our dense simulation (and experimental) data in NetCDF/HDF5. Given correct chunking, this seems to be pretty efficient both performance and compression wise. What would we gain using TileDB? How does performance compare with HDF5?


Stavros from TileDB, Inc. here: HDF5 is a great software and TileDB was heavily inspired by it. HDF5 probably works great for your use case. TileDB matches the HDF5 performance in the dense case, but in addition it addresses some important limitations of HDF5, which may or may not be relevant to your use case. These include: sparse array support (not relevant to you), multiple readers multiple writers through thread- and process-safety (HDF5 does not have full thread-safety, whereas also it does not support parallel writes with compression - I am assuming you are using MPI and a single writer though, so still HDF5 should work well for you), efficient writes in a log-structured manner that enables multi-versioning and fault tolerance (HDF5 may suffer from file corruption upon error and file fragmentation - you are probably not updating, so still not very relevant to you). Having said that and echoing Jake's comment, we would love to hear from you how TileDB could be adapted to serve your case better.

A general comment: TileDB’s vision goes beyond that of the HDF5 (or any scientific) format. Considering though the quantities of HDF5 data out there (and the fact that we like the software), we are thinking about building some integration with HDF5 (and NetCDF). For instance, you may be able to create a TileDB array by “pointing” to an HDF5 dataset, without unnecessarily ingesting the HDF5 files but still enjoying the TileDB API and extra features.


Jake from TileDB, Inc. Performance wise I would look at the referenced paper in this thread which provides benchmarks for various workloads. As to what advantages TileDB may offer you that is problem dependent, esp. compared to dense simulation output data which is the use case HDF5 was designed for. If you have specific suggestions for ways to improve HDF5 for your use case we would love to hear about them.


I wish they had python bindings.


Their Python, R, Matlab, and Excel bindings are in-progress.


Looks exciting. What's the easiest way to get started with this via Python/NumPy? Looks like this is a design goal, but not currently supported.


Stavros from TileDB, Inc. here: We have elevated the Python/NumPy bindings as our top priority and have already started development. We will try to ship them asap. :)


Excellent! Good to hear :)

Also saw in your documentation that you're concentrating on lossless compression right now, which makes complete sense. However, as a scientist, I just want to put a vote in for lossy compression too: it's not uncommon to work with large datasets given in float64 (because float64 is used for any intermediate processing steps), but that actual final precision we need to store is much less than that, but we're stuck with these huge binary files.


The way the code is currently architected allows us to add support for pretty much any compressor (compatible with our MIT license) with minimal effort. So, please do send us suggestions about your preferred compressor and we will add it pretty quickly. Thanks!


Looks nice, still have to read the whole paper. Seems like it's most useful for sparse arrays. Maybe we will get a golang port :P


How long until we can have Julia wrappers?


Since jakebol is involved in both TileDB and Julia, this might be up and coming?


I am curious about how the query performance compares to working with JSON files in Spark for ~100GB data.


Jake from TileDB, Inc. here: Depending on the structure of the JSON files you are querying you maybe able to take advantage of columnar compression and massively reduce the dataset size (especially if the json files contain numeric data). Also, repeat queries will not have to re-parse the JSON files. This may speed up queries quite a lot, but it depends on the specifics of your problem.


Stanislav Seltser, Petacube you are talking comparing structured workload(array-based TileDB) to unstructured one (JSON+Spark). Once you convert your JSON to sparce array structure (one time conversion) TileDB will beat Spark+JSON by several orders of magnitude. Caveat: assuming your spark+json workoad is a some heayy processing not a lightweight one.





Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: