Thanks. I'd heard of Redis -- it looked like overkill and more work just to unde...

zemo · on May 4, 2019

yes, redis is single threaded. The author's stated position is that if you want to use more cores, you run more instances of redis on the target machine (but then you have a scheme to map keys to running redis instances as if it were multiple servers). Keeping it single threaded is that it makes reasoning about the data guarantees simple. You're guaranteed that no two redis commands are in flight at the same time.

serialize an object to an opaque run of bytes, writing it to redis with SET, get it back with GET, is like ... the majority use case of redis. It's basically just a big hash table with a nice networking layer, that's all it is. It'll do a few hundred thousand operations per second if all you're doing is getting and setting opaque values. You don't have to use the other features, and their existence doesn't cost you anything, but you can have, for example, a hash table of strings, or a list of strings, or a skiplist of strings, or a set of strings, and those things all come in handy. For example, if you have a set, you can just add elements to the set, and if you want to know if the set has something, you just send the value and it gives you a bool back; you don't have to read the entire set to check it for one value.

yeah, you're describing a very naive implementation of modulo sharding. That's a pretty old technique, although you wouldn't generally put the shard index into the key. If you do that, you're making it impossible to move the value, because you'd have to change the key. `foo1` and `foo2` are different keys; you have to be able to move the value without changing the key. So generally you wouldn't put the offset in the key itself, you would take a hash of the key and modulo the hash to find the index of the node containing that key.

But anyway partitioning the data that way is easy. Nobody is contending that writing it that way once is easy. The problem occurs when you want to add or remove servers. Re-partitioning a live dataset is much harder; adding and removing a node means having to reassign and move a bunch of keys. Unless you build some sort of query-routing proxy, you're also requiring the clients to be aware of the cluster topology, and you have to coordinate cluster changes with the clients.

TCP buffer as command queue is hilariously idiosyncratic, I've never given it much though. Dunno what the failure modes are there. TCP flow control should dictate that the sender stop sending data, but I don't know what that means practically; it might cause the sender to block until the queue clears.

graycat · on May 4, 2019

Many thanks.

Your points on the topology of the architecture of the servers is well taken.

My idea for taking a session state server off line has been just (1) to send it a message to decline more new pairs and (2) to wait until all its key-value pairs have timed out and then just shut it down. But there's a flaw here: An actual point in server farm system management uses a variable part of the Web site logic, that is, the time out intervals. That's not so good.

But other ways of session state server shutdown would about have to be able to move key-value pairs from the server to be shutdown to another key value server. Maybe easier would be just to replace each session state server with two essentially identical. Then if don't have a proxy (itself a single point of failure?) have each client send everything to both of the session state servers. In a serious data base environment, there would have to be commit logic, etc. Hmm. Otherwise, when want to shutdown a session state server, then just shutdown one of a pair.

To bring new pairs on-line, then just send to all the clients a table of all the session state servers they could use. Lock the table, change it, and unlock it? Or just have the table accessed via a pointer and in an atomic operation change the pointer to the new table? Then to delete the old table have to wait until it is no longer in use? Hmm ... concurrency issues which I'm trying to avoid! Hmm, put the new table on some server and then send all the clients a message to go to the server and get their copy of the new table -- single threaded solution!

Moving key-value pairs from one key-value server to another sounds to me like Cadillac, filtered air, humidity controlled air conditioning in an open Jeep! Wow. My startup isn't for stock trading, banking, or health care! If a key-value store fails, then on some Web servers there will be some TCP/IP session time outs and some Web site users will get some messages with apologies and some cartoon of a mess up.

Hashing could be faster than AVL trees. There is a tweak to hashing in

Ronald Fagin, Jurg Nievergelt, Nicholas Pippenger, H. Raymond Strong, 'Extendible hashing—a fast access method for dynamic files', "ACM Transactions on Database Systems", ISSN 0362-5915, Volume 4, Issue 3, September 1979, Pages: 315 - 344.

that has a graceful way around collisions. We used that in some work at IBM's Watson lab for their Resource Object Data Manager (RODM) shipped with NetView.

For one of the collection classes I'm using, for detecting time outs, the key is a time-date stamp, and I want a fast way to read the pairs in ascending order on time-date stamp, right, starting with the oldest time-date stamp. So, mere hashing would not offer that. Ah, I fell in love with AVL trees when I first read about them in Knuth, The Art .... I'm just hoping that the .Net class is based on AVL trees or something equivalent.

Somewhere between the user's fingers and the most distant parts of the server farm there have to be some flow controls of some kind.

So, if the TCP/IP input queue to a session state server fills, then likely, right, the Web server making the request will block, that is, wait, maybe time out. So, that would be one source of flow control.

Maybe I should think about implementing more.

There's an old approach to flow control: Just don't respond to the user so fast. Instead, make the user wait a little. That approach, in effect, slows the rate of data input to the whole Web site.

Maybe I should think, at least for the future, say, version 2.0, about actually designed means of flow control, rate limiting, etc.

What do people do about that, say, at high end sites?

zemo · on May 4, 2019

> wait until all its key-value pairs have timed out and then just shut it down

you can do it that way but it often winds up making the server's shutdown dependent on a client behavior. For example, the liveness of a key is determined by the client in most applications, not the server. Any sort of naive TTL system where a read on a key extends its lifetime means that the server cannot shut down without permission from the clients, because the client can just keep reading the same key over and over, extending its lifetime until the client shuts down. I have a system that works this way now, I'm working on removing that behavior. When I want to take a server offline, I have no idea if it will take ten minutes or four hours to drain. It's very annoying.

> have each client send everything to both of the session state servers

generally a client would write to just one server (server A), and that server would write to the other server (server B). If server A applies the change locally and then confirms it with the client, then syncs the change to server B independently of the client request, that's asynchronous replication. If server A waits until server B has applied the change locally before confirming the write with the client, that's synchronous replication. If the client can only send writes to one of the servers, but can send reads to either server, that's master-replica replication. The easiest thing to reason about is synchronous master-replica systems: client sends a request to A, A applies it to A's storage, A sends it to B, B applies it to B's storage, B confirms it with A, A confirms it with the client. If a client can send a write to either server, that's master-master replication. Master-master replication is harder to implement; what if two different clients write the same key on two different servers at the same time? A conflict arises.

Concurrency issues are unavoidable in web systems. Uncoordinated users are using the system simultaneously; "everybody wait in line" is not a realistic option in practice.

> What do people do about that, say, at high end sites?

that varies pretty widely. Depends a lot on what you're building, how much money you have, how much time you have, how many users you have, how many engineers you have, how easy it is for you to hire, what skills your current engineering team has, what clients will be accessing the data, etc. The permutations of concerns are innumerable. Lots of people buy hosted redis from redislabs.com or AWS Elasticache because they are either unwilling or unable to run a redis server themselves. That's a great solution for most people most of the time; it works great for a lot of use cases, it's relatively inexpensive, and the providers give you useful tools that you would otherwise have to build yourself.

> Maybe I should think about implementing more.

Are you selling a key-value store? What is your startup?

Maybe you should think about implementing less, and focusing on building the things that actually matter to users. Key-value stores are nice engineering problems, but they're implementation details: users don't know about and don't care about their existence, and their existence does not drive revenue. And that's why most people just buy hosted redis, download a redis client library, and call it a day. It's not technically the best solution, but it's often times the most economical solution.

graycat · on May 5, 2019

Many thanks. A keeper.

Of course, you are fully correct about waiting until all the users have timed out. Even if I set the time out to 1 hour of inactivity, a user could do something each 10 minutes for days, as you mentioned.

The only things I would consider adding to my key-value Web session state store would be some little system management commands/queries. I have some now. They are dirt simple to implement. We're talking no more than an easy day. I wouldn't be trying to implement much. I'm learning: I time I'll want a graceful, fast, high performance, and reliable way to move key-value pairs from one server to another one. If Redis or some such has that, then I won't program it -- buy, not build.

Early on I have been intending to do nothing special about hardware redundancy for on-line reliability. I suspect I can get the equipment and software needed for the on-line work to run fine for weeks at a time. That should be enough for alpha test, beta test, version 1.0, and some ad revenue. Then I'll take some simple approaches to hardware redundancy and keep growing. Maybe a Cisco box between the Internet and my Web servers could help bring Web servers on/off line. And, sure, some uninterruptible power supply boxes for the computers, network equipment, lights, etc.!

> Are you selling a key-value store?

Heck no!

> What is your startup?

Might as well say -- close enough to the alpha test which I intend to announce here:

It's a new and very different Web content search engine for the, my guess, 2/3rds of the content on the Web, searches people want to do, and results they want to find currently served at best poorly. It's all safe for work, families, and children, uses no cookies, has no logins, and likely meets the EU GDPR standards.

It makes no use of old data specific to a user: That is, if two users give the same inputs on the same day, then they will get the same results (assuming I do system updates at night!). They may not get the same ads.

There is no use of key words.

It gets new data via an interactive, iterative dialog. In this way, the results are highly personalized. Users get to drill own, focus in, filter out, zoom in. Actually it's more about discovery and recommendation than search.

Or, roughly key word search requires a user looking for Web content (i) to know what they want, (ii) to know that it exists, and (iii) to have key words that accurately enough characterize that content. I suspect that now often what the users are getting is some "Top 40" version based on popularity.

When a user has (i)-(iii), key words can work well. If what they want is really popular, then they can do a sloppy key word search and still get what they want. But IMHO that covers only about 1/3rd of the content, searches, and results of interest.

Some people at Battelle knew some such things long ago.

Key words are especially poor on still images, video clips, and recorded music. All that aside, in the 2/3rds, what a user really wants is the content with the meaning they have in mind. Getting at meaning has been challenging: Key word search has a tough time getting at meaning even for content based on text.

To be clear, in the 2/3rds, the user is looking for something they have never seen before and don't even know exists. So, they need recommendation and discovery of content with the meaning they want.

The key to the whole thing is some applied math I derived, complete with theorems and proofs, to process the data. The applied math is based on some advanced pure math prerequisites. The data manipulations are not heuristic, data science, machine learning, neural anything, or artificial intelligence and, really, not in computer science. Instead, the core is some math -- of course the users won't be aware of anything mathematical.

I had the code running on Windows XP, but then that motherboard quit. I didn't lose any code. Now I'm bringing up the code, for an alpha test, on the box I plugged together, an AMD FX-8350 at 4.0 GHz with Windows 7 64 bit Professional. The code is about 24,000 programming language statements in about 100,000 lines of typing. Yup, it's all in VB.NET -- the language looked fine to me.

Likely in time I should convert to some version of Windows Server.

Now, if I can get simple file sharing not to make a mess out of archive bits, it will help.