Hacker News new | past | comments | ask | show | jobs | submit login
C++ containers that save memory and time (google-opensource.blogspot.com)
231 points by dsr12 on Feb 3, 2013 | hide | past | favorite | 111 comments



Facebook released their C++ containers last year as well, which they named Folly. An interesting feature they added is that their vector class is "aware" of the memory allocator. Ie, if jemalloc is available, then the vector class will attempt to resize all dynamic memory structures to fit within cache hierarchy sizes.

http://news.ycombinator.com/item?id=4059356


Reading C++ code always gives me a weird feeling something is inherently wrong with the language. Header files containing nearly all the implementation because that's the only place where you can define templates is so odd... (https://code.google.com/p/cpp-btree/source/browse/btree.h)


Comments like this one don't help either:

    // Inside a btree method, if we just call swap(), it will choose the
    // btree::swap method, which we don't want. And we can't say ::swap
    // because then MSVC won't pickup any std::swap() implementations. We
    // can't just use std::swap() directly because then we don't get the
    // specialization for types outside the std namespace. So the solution
    // is to have a special swap helper function whose name doesn't
    // collide with other swap functions defined by the btree classes.


I don't get whats wrong. They have an implementation of swap in their namespace already they don't want to use, the standard one isn't templated to work with their btree, so you use a different swap in methods.

Sounds more like this should be btree::swap and whatever they have as swap() should be something else.


I think the syntax gymnastics needed to achieve the result are quite complex for what they aim for. There's nothing technically wrong with it. It's just that choosing a function from a namespace shouldn't need a 7 line comment about how and why in my opinion. It's a single statement function.

Additionally `using std::swap; swap();` not being exactly equivalent to `std::swap();` is rather strange (even if it can be explained). A candidate for code/design smell in my opinion.


> I think the syntax gymnastics needed to achieve the result are quite complex for what they aim for. There's nothing technically wrong with it. It's just that choosing a function from a namespace shouldn't need a 7 line comment...

It's probably important to point out that your last sentence above is not correct. That's not what this code is doing. If all it needed to do was call swap from a particular namespace, it would be as simple as you imply. It's trying to maximize the caller's ability to specialize swap for the type he's using in the template, while still allowing for std::swap to be used as a fallback.

> Additionally `using std::swap; swap();` not being exactly equivalent to `std::swap();` is rather strange (even if it can be explained).

One says "use standard swap". One says "bring std::swap into this namespace and choose the best swap". I agree with you the difference is subtle and non-obvious. But it is important when doing sufficiently generic programming.


There is nothing wrong with that code. The comment is strange, but the code is fine. 'swap' function call has to be resolved via ADL and this is how it's done correctly. It's unfortunate that the C++ Standard essentially requires swap to be called this way (i.e.: using std::swap; swap(a, b);). In your code it's best factored out into a separate 'swap_helper' function and this is what's done there.

C++ is convoluted. Sometimes code that seems quite wrong has a good reason to exist.


That helps a lot. This is a pretty complicated nuance in the implementation that you rarely find in libraries in any language.


I would suggest if you have this feeling that you don't have a good grasp on metaprogramming. This link might help: http://www.parashift.com/c++-faq-lite/templates-defn-vs-decl...

Also the use of extern keyword for templates which is valid according to the standard is limited by compiler compliance. Kind of a chicken-egg problem, not a language problem.

Finally you might investigate the extern template keyword in c++11: http://en.wikipedia.org/wiki/C++11#Extern_template

I have always wondered whether a grand rethinking of compilers/linkers wasn't in order given the radically different face of computing and computational resources available now as opposed to 30+ years ago.


Since you mention the FAQ, I just have to mention the FQA: http://www.yosefk.com/c++fqa/templates.html#fqa-35.12

More seriously, the primary use of template is not meta programming, but good old parametric polymorphism (or «genericity»). They are more capable than that, but only by accident. https://en.wikipedia.org/wiki/Turing_tarpit

If you want actual meta programming, look at Haskell combinator libraries, Lisp macros, and external DSLs (in ascending order of "meta").


> I would suggest if you have this feeling that you don't have a good grasp on metaprogramming.

I don't think someone has to be an expert on metaprogramming to recognize the flaws in C++ (although I do think it helps).


or know c++ to be an expert in metaprogramming.


Thanks to Comeau C++'s standard compliance, you can put the template into the .cc files. ;)


I think the complaint is more about the C++ compiler needing to expand the template every place it's used, as opposed to Java/C# generics and OCaml functors (functions taking modules as arguments and returning modules as values) which necessitate a pointer indirection at runtime (which a VM/JIT/smart linker may or may not be sophisticated enough to optimize away) but allow the call site to be ignorant of the implementation details at compile time.

There are workarounds in C++, but they're less elegant.

Arguably, the decision to keep using the C linker and use name mangling and template expansion in the compiler instead of making the linker smarter (or adding a phase between compilation/assembly and linking) was a mistake. Some kind of object file pre-processor that performed name mangling and template expansion on object files probably would have still been necessary, at least in the early years. Certainly incremental compilation and linking would be easier if template expansion were performed on object code (or a low-level intermediate representation) rather than source code.


> Arguably, the decision to keep using the C linker and use name mangling and template expansion in the compiler instead of making the linker smarter (or adding a phase between compilation/assembly and linking) was a mistake.

I don't know that it was a mistake per se, but this problem with template compilation overhead is definitely a seemingly inevitable consequence.

> Certainly incremental compilation and linking would be easier if template expansion were performed on object code (or a low-level intermediate representation) rather than source code.

This is effectively what happens... it just happens at compile time instead of link time. Given this template:

    <bar.hpp>
    template <typename foo>
    foo bar(foo x) { return foo; }
a compiler is asked to compile:

    #include <bar.hpp> //source code is parsed and intermediate representation generated by the compiler

    int x = bar(some_int); //performs template expansion on said object code
    float y = bar(some_y); //performs template expansion on said object code
    //bar.hpp's source is examined only once!
This is why you can't do some things in templates that you can do with say, C macros (and vice-versa).


Actually, C#'s generics don't require a pointer indirection. It works almost like a hybrid between C++'s templates and Java's Generics.


Even the EDG guys who wrote that support for the export keyword feel like it was a really, really bad thing: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2003/n142...


The template export feature was removed in C++11 (though the keyword itself is still reserved for potential future use).


I think Sun's compiler also was compliant.


Yes, it is a flawed language to have a declaration and the definition are all in one place together. What other language does this aside from Ruby, Python, Perl, JavaScript, Java, C#, Smalltalk, VisualBasic, PHP, Haskell, Lua...


None of these other languages requires code to be compiled each and every time it is imported/included.


That's really an implementation issue, not a language issue. A C++ interpreter (http://root.cern.ch/drupal/content/cint) for example, doesn't process the includes any more than most of these others. Most modern C++ compilers of course have to parse the file, but do not actually compile code unless it is used. The more sophisticated ones can usually cleverly avoid even parsing a file a second time.

Technically most of those languages some degree of processing of those import/includes. Much to my chagrin Python actually evaluates the entire file, as does Perl, Ruby, PHP & JavaScript. Java, Haskell and C# (and I guess now VB) do their static type checking (often very efficiently because the type information is cached or at least already byte compiled), and Lua does a fair bit of processing as well. Smalltalk is image based, so it's kind of a very different context, but it probably could make the best case for "not having to process every time a module is imported/included", because code is loaded in to a global symbol table once, and most implementations wait until message dispatch to resolve things further.

In general, I'd say the nature of your concern is more about long compile times, and C++ definitely is "impressive" in that regard. Some of that problem lies within the language's idiosyncrasies (sticking with the legacy C linker being a big one), but a lot of it is simply that it tries to resolve so much at compile time rather than run time. While not always desirable, I think there is definitely a place for tool like that.


Yes, compile times are my main concern. It is an implementation issue, just like the fact that humans die. It doesn't have to be that way, does it? :-)

The other concern is that you have to first find and then change code in two locations most of the time. On top of that, function signatures in the header are different than those in the implementation file, so you can't just copy them.

This last issue is a language issue and it is utterly ridiculous. There is no sane reason why you're not allowed to leave virtual, static or parameter default values in the code when you copy it into the implementation file.

[Edit] One more thing: If a language is designed in a way that makes it practically impossible to create a fast implementation, that is a language issue in my view. Doesn't mean that such a language has no place. It does and I'm using it.


Two words: Precompiled Headers. Once they're used programs can be designed specifically with them in mind. This gives the compiler more flexibility for code generation and inlining as well, not just for templates.


Haskell is a great example where interface and implementation are decoupled.. typeclasses are used a lot to describe problems with similar patterns. Look at monads, monad transformers, applicatives and functors. I don't think half of them even have default implementations defined.


You're description of Haskell is accurate, but so often I find that encapsulation in Haskell is not all it appears to be. More often than not, you need to dig in to the implementation of the components you use.


I find that argument valid for functional programming languages with no type system like LISP. They promised us black box abstraction without giving us hints about what kind of data a procedure takes.

but with Haskell the type system usually gives sufficient hints on what is happening within a procedure


Welcome to templates and inline functions.


agreed.


And this is why you need to know all the algo and data structure "basics" when interviewing at the big G. Because people there are working on these sorts of optimizations in their spare time.


Agreed. When I was working on Google App Engine, the Quota team sat next to me. The Quota team are the service that just about every other service at Google calls to ensure a user isn't abusing "The Google". That means that they need to have insanely small response times to ensure that all the services run on time.

They hit an issue no-one really thinks about: resizing hash tables. No-one thinks about it, as it happens transparently under the surface, but for them, the few milliseconds it took to resize the hash tables would have enormously detrimental effects on just about every other Google service. The solutions they were talking about were really quite amazing. That's the insanity they face.

So, if anyone hasn't seen Google Sparsehash[1], check it out: it's another near drop-in improvement to the STL. Two bits per entry of overhead and insanely fast. I can't remember if the Quota team used it specifically, but I wouldn't be surprised if it's one of the results of their work.

[1]: http://code.google.com/p/sparsehash/


No one thinks about resizing hash tables...except for any high-frequency trading firm worth its salt. And probably several game companies. I'm sure there are others.

It gets weirder when we talk about C++ in particular: did you know that thanks to a standards committee decision, erase by iterator in a typical C++ TR1 hash set or map is O(n)? Erase by key actually has better performance despite the extra lookup. People do think about these things, in a few places, but it can be hard to get ideal performance standardized...even in C++.


Redis has an implementation of incrementally rehashing hash table in "dict.c" file. Very simple to understand code.

https://github.com/antirez/redis/blob/unstable/src/dict.h

https://github.com/antirez/redis/blob/unstable/src/dict.c

As you can see no rocket science there.


Really great C code. However, "timeInMilliseconds" should really use clock_gettime(CLOCK_MONOTIC, ..) instead of gettim eofday(..). A sudden change in time due to NTP, administrator, timezone, etc, could have the loop in "dictRehashMilliseconds" running for longer than expected.

(Here's someone explaining the problem with gettimeofday http://blog.habets.pp.se/2010/09/gettimeofday-should-never-b...)


Good advice, thanks, taking note


Reading this code made my day, thank you.


> did you know that thanks to a standards committee decision, erase by iterator in a typical C++ TR1 hash set or map is O(n)

Weird, never heard of that. What is it about the iterator requirements that causes erase to be O(n)?


It's the fact that erase(iterator) returns the "next" iterator. Since hash tables often don't shrink, they can become quite sparse, and the "next" operation is linear.

This is the best link I have right now--a ticket I filed (and was fixed) for Boost's implementation: https://svn.boost.org/trac/boost/ticket/3693

Note the references to GCC's implementation. As far as I can tell they fixed it in 4.7, but caused some other performance regression which is still open. And most people don't use GCC 4.7 in production environments yet.


Reading your comment, I thought I would mention this especially nasty bug in 4.7.1: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54075

Short description: a measured 3x performance regression from 4.6.2 to 4.7.1 when using std::unordered_map

Not sure what the status is now, my Ubuntu and Debian boxes all report gcc 4.7.2.


This is extremely unfortunate - I can see the way the C++ committee argued about this issue (ie. having consistent "erase" interface across different data structures), but it's bound to backfire for many unaware. Especially in a hash table, where you conceptually don't expect erase to provide the successor.


Cost of hash resizing is something that was covered in my second undergrad CS course, and something that has been a recurring subject in C++ books and articles for the last two decades or so. I'm sure there are a lot of people who don't think about it often, but mainly because most people don't need to.

Performance in general is something most people don't need to consider much.

While these Google projects are great, it's also worth considering that a lot of the time, having a near drop-in replacement for an STL container may not be worth it, because the improvements it brings might not be worth the cost of bringing in a non-standard component outside of the type of extreme situations Google faces.


Thinking about the cost of resizing hash tables is standard practice in latency sensitive and/or highly concurrent systems programming.


Actually sparsehash is not optimized for speed, but reduces the usual hash map memory footprint. In practice it is on par or slower than STL unordered_map.


I should have specified earlier: Google Sparsehash also contains Densehash which is optimised for speed, not space. It, however, is also more space efficient than most other hash implementations. Check out dense_hash_map[1]

[1]: http://sparsehash.googlecode.com/svn/trunk/doc/implementatio...


For these problematics there are hashtables with progressive resize.

For the theory you can see: http://en.wikipedia.org/wiki/Linear_hashing

I did one in C at: http://tommyds.sourceforge.net/ It's competitive with the googledensehash, sometimes slower, sometimes faster. In the benchmark section there are nice graphs :)


> They hit an issue no-one really thinks about: resizing hash tables. No-one thinks about it, as it happens transparently under the surface, but for them, the few milliseconds it took to resize the hash tables would have enormously detrimental effects on just about every other Google service.

Hmm.. not sure I'd agree with your assessment that no-one thinks about it. Certainly most DB engines are very aware of this characteristic of hash tables, and it is one of the reasons B-tree style indexes are preferred (not to mention the advantages of sorting).

I think in general, people doing systems development tend to be very aware of these kinds of issues, and perhaps Google is comparatively rare in that it is a company where systems and application development are so vertically integrated. I'm not even sure if that is so terribly unique.


What does knowing this stuff at interview-time have to do with being able to implement it later? I certainly don't remember how b-trees work off the top of my head, but what I do know is how to find out the answer given a computer and an internet connection.


Having the details and characteristics of the b-trees(among other things) in memory allows you to consider them as possible solutions to a problem you are facing. If you dont know the details you would probably not know they are a good fit for a particular problem. The more detailed knowledge you have in your memory the bigger your possible solution set will be, thus, you will have a better chance at solving a particular problem.

I think Google is not looking for people that can follow specs(although thats valuable also) but people that can come up with brilliant solutions to hard problems.


I would think a person who can specialise, learn and memorise the details and characteristics of a particular domain quickly, say b-trees, when he is working on relevant problem, would be more important than knowing about all things computer science related at the same, and lesser, detail.

Wouldn't it be more important to have people to come up with brilliant solutions on problems they are working on, rather than hard problems in general that have already been solved?


Typically, one doesn't know that a problem is in a specific implementation domain (b-trees) unless one knows that domain.

For example, I have M time slots and N people that need to be scheduled. Each of the N people has a time schedule for the M slots (either free or busy). If a person doesn't know about bipartite graph matching (or even graphs at all), then they're less likely to come up with a polynomial-runtime solution. Now, if a person knows about bipartite matching, they might as well learn the algorithms (max flow, etc) to solve it (even for fun, at least I know I would).


As always in our field it is somewhere in the middle - having knowledge of the platform and the field is something that can give you some hints in advance about potential problems with the approach for a difficult problem.

After all a timely - "This is not going to work well because there will be lots of branch mispredictions and you will be going out of the L3 cache a lot, and RAM is terribly slow" can save a lot of time and effort for some classes of tasks that big scalabiliy companies have to deal often.


>Wouldn't it be more important to have people to come up with brilliant solutions on problems they are working on, rather than hard problems in general that have already been solved?

Most of the times the guys that can come with "brilliant solutions" on new problems are those that know the solved problems inside and out.

The other guys just keep reinventing the wheel (only worse).


I averaged somewhere between one and two interviews per week during my 4 years with Google. Google has (or at least had) more applicants than they can shake a stick at. They have the choice between smart people who can look up things on the Internet but aren't interested enough to memorize the basics, and smart people who are interested enough in the subject matter to have the basics memorized (or at least are smart enough to put on a good show of loving the subject matter).

Maybe having the basics memorized isn't necessary for the job, but it's a good signal that you're not a zombie banging out code without really being interested or thinking critically about what you're doing. It's not a perfect signal, but it's still a useful signal.

Everyone knows you need to know your basic data structures for a Google interview. If you don't know your data structures, cram for a week or two before your interview. The basic data structures are just the basic starting point for the data structures interview questions. The interviewer needs to know you know the basic data structures before they start throwing curveballs at you to see how you think about and solve hard data structures problems. The people who think they're too smart or too cool for the basic data structures questions are just going to drown, so if they bomb the basic data structures questions, the interviewer justs moves on and pretends that was where the question ended. There's no sense in frustrating the interviewee and making the whole company look mean and forcing the interviewee to answer questions they lack the background knowledge to begin solving.

On top of that, if you can't even be bothered to cram some basic data structures for an interview, you're not going to fit in well at Google.

I also asked some constrained bin packing questions, related to scheduling distributed computations in a cluster and placing files in a distributed filesystem, but I'd simplify the problem a bit and make up some slightly silly story about kids selling candy or something. The simplification and the non-computer back story helps reduce the amount of assumed specialized knowledge, and keeps people from over-thinking the thing, thinking it's a trick question, or otherwise running off into the weeds.

Silly stories really do help people accept the simplified problem and not get lost thinking they're supposed to be thinking outside the box. If you ask a question about networking, people will suggest reducing the number of intermediate switches or maybe even swapping the network media with something with a higher signal propagation velocity. Rephrase it as kids playing Pokemon with cards, and they're moere likely to notice that O(N^2) network traffic can be reduced to O(Log(N)).

Some people just thought some problem involving candy or trading cards instead of computers was beneath them and silly, obviously missing that the underlying problem had real implications for computing at massive scales. If they got upset about the problem, I just let it go, since arguing with them or making them feel dumb by pointing out the isomorphism would just rile them up for their next interviewer. I usually genuinely hoped they'd do better with the next interviewer. Good engineers are hard to find, and if I'm the only person they rubbed the wrong way, they're unlikely to be working with me anyway.

If you notice how the simplified silly problem is probably related to a real problem the interviewer has had to solve, it's probably best to make a nonchalant comment. "Oh, so it looks like a constrained bin packing problem, maybe with applications in distributed storage or distributed process scheduling, or something like that. Let's see, you said the Hershey Bars can't be kept near dogs, so let me think about the heuristic of allocating them first. They're each $X and Y Calories, so ..."

If you think an interview question is silly, make sure you hide that opinion well from the interviewer. You're likely missing something, especially if your interviewer appears otherwise bright.


On top of that, if you can't even be bothered to cram some basic data structures for an interview, you're not going to fit in well at Google.

This. If I'm going on an interview I will look up what the company does and hopefully how they tend to interview and study a bit. Of course I know things like FizzBuzz, but if I haven't done it in a few months I'll quickly throw it together in a few languages I have on my resume. It's common sense preparation.

but I'd simplify the problem a bit and make up some slightly silly story about kids selling candy or something.

I have found that generally the people I want to hire love thinking and talking about these sorts of problems. I hate when people ask 'what does this have to do with programming?' Ugh. Programming is problem solving.


>Silly stories really do help people accept the simplified problem and not get lost thinking they're supposed to be thinking outside the box. If you ask a question about networking, people will suggest reducing the number of intermediate switches or maybe even swapping the network media with something with a higher signal propagation velocity. Rephrase it as kids playing Pokemon with cards, and they're moere likely to notice that O(N^2) network traffic can be reduced to O(Log(N)).

Interesting! Does this also work for actual problem solving?


Sounds like a fraternity hazing ritual to me. I'm curious your target have access to Google.com during your "interview"? Also what is going to happen to your hazing process when people start walking in with Google glass? If your canned intelligence test questions gave been recorded and indexed aren't you going to feel a little silly?

If a company is using f2f interviews to "weed" out candidates they either have a very broken recruitment team or really think so highly of themselves that they are in need of a wakeup call.

For having such a reputation as an innovative company it sure seems odd to rely on an antiquated process that seems straight out of the social eugenics movement of the 1930s.


I worry that I'm feeding a troll here, but...

> I'm curious your target have access to Google.com during your "interview"?

No, of course they don't. The idea is to see how well they think without being able to "just google it", so when they're tackling a problem nobody has tackled before, they aren't completely lost.

> Also what is going to happen to your hazing process when people start walking in with Google glass?

Probably they'll get asked to take them off, much the same as you might do to a candidate who walked in using a cellphone.

> If your canned intelligence test questions gave been recorded and indexed aren't you going to feel a little silly?

Candidates are told that the interviews are confidential. And no, I wouldn't feel silly if people recorded the questions and put them on an internet. It would be kind of a warning sign if we were embarrassed about them - the reason they're confidential at the moment is the same reason you can't look up upcoming exam papers on the internet.

> If a company is using f2f interviews to "weed" out candidates...

What on earth is a job interview for if not to reject some candidates? It's kind of a negative way of looking at it, but that is exactly what they're there for; to weed out candidates who are good (they must be at least adequate to get that far) but not good enough at the moment.

I'm not even gonna start on the last paragraph. Goodness knows nobody, especially Google, have a perfect interview process, but comparing it to eugenics is a little hysterical.


Umm, a common practice in the US in the 1930s at state fairs was to take IQ tests in order to become informed about your "feebleness" so as to aid you in your decision and fitness for marriage and procreation. In some jurisdictions, you could not enter into marriage if you were "feeble-minded." [0]

Most Americans don't realize there were several programs of enforced, mandatory sterilizations of the "feeble" as part of the social eugenics craze in the US. Faribault, Minnesota was a pretty active center of such activity.

Anyway, I think the comparison is apt and illustrative. Using "intelligence" measures to quantify someone for something entirely irrelevant to the task being tested for. In the first case "thinking on your feet" questions and professional software engineering and in the second procreating and raising children free from genetic defect. Hopefully with a little reflection you will see the connection too.

[0]: http://en.wikipedia.org/wiki/Eugenics_in_the_United_States


Well, fortunately, Google don't sterilise people who fail the interviews...


Whew. Good to know, I'm glad you cleared that up. I'm assuming they also are not racist because that is part of the analogy carried to illogical extremes too, right?

You mention that the interview process is confidential and thus immune from recording/playback. Is that true? Do candidates sign something? In any case, I think your allusion to academia is interesting ("same reason you can't look up upcoming exams"). Isn't this entirely the wrong model for forecasting future job performance and the main complaint of most people about the undergraduate university system ("learn" in order to pass the test)? The original comment even mentions failure to "cram" prior to the interview as an indicator of insufficent enthusiasm.

Anyway, I think if a candidate came into an interview and needed google or stackoverflow or whatever to function, I would provide that as a resource and would use it as an opportunity so as to judge the dependence and quality of their workflow because it is not extraordinary to see how everyone (even brilliant geniuses I know!) use these resources on a daily basis. I don't think I would deny someone access to prescription medications affecting cognitive function/enhancement either nor could I even do so legally. How on the one hand can you use internet contributions (well-reputed blog posts, open source contributions, active social media following, etc) as positive evidence of candidate desirability and also at the same time view using same as negative?

I guess I would could care if I was for example screening someone that as part of their duties they were expected to say represent me speaking at a conference or that the work product was extremely confidential (something which would necessarily require curtailing access on the job).

Finally I have to imagine in the very near future if the current trend of viewing "internet access" as an universal human right continues and it becomes even more of a basic enabling technology of the human experience, denying someone access during a job interview is going to become a dicey proposition the same as not providing accomodations for and not taking into account the cost of say wheelchair access, etc is today. The irony of Google denying access to the internet during a job interview is not without some comedic value.


How is this a hazing process? The OP mentioned a number of ways he tries to simplify it or keep people from being intimidated by the technological implications. They are looking for people who can think on their feet.

To be honest your entire post sounds jealous / bitter.


A hazing process is a tribal ritual used to stress candidates and humiliate them to weed out those who are deemed unfit to some standard that members of the actual tribe usually don't conform anyway and to measure the devotion of the candidate to the tribe by submitting to and acquiescing to the authority of the ritual.

I have no idea why you would suggest jealousy. I have no intention of ever working for Google unless I am acquired. Perhaps my reply was poorly worded, but I find this kind of process very unenlightened. If I'm "bitter" it's only because I guess I would expect better from a company that has vacuumed up so much available oxygen. Also, I have no idea whether this is actually the case at Google, I was just responding to someone who represented their experiences as such.

Finally, the last thing you want from an engineer is to be "good on their feet" unless you are hiring for some critical ops position. Rather you want someone that takes measured analytical approach.

This is like choosing a President based on their debate performance or the security provisions of the TSA. It's just theater bordering on the absurd.


> I'm curious your target have access to Google.com during your "interview"?

I think it would have been a good idea for Google to publish online an official "Google interview cheat sheet" and hand a copy to the interviewee at the start of the interview.

However, if the candidates know beforehand that the web will be available, they're going to study less; it's just human nature. If they have 20 minutes for a question, and they have to use the web to recall the basis for the question, they're probably going to waste 5 minutes compared to someone who has taken the time to study.

> If your canned intelligence test questions gave been recorded and indexed aren't you going to feel a little silly?

They weren't intelligence questions. Many of brilliant people would have failed miserably and many of people with IQs lower than the Google median would do very well. Many of the questions I asked were simplified versions of problems I saw and fixed in Google's codebase. I wanted to make sure that people who got hired were capable of diagnosing and fixing the real kinds of problems I had seen.

Edit: spelling mistake, added "it's just human nature".


Note that I wrote the OP in response to someone who seemed to think they were better than having to memorize/review the basic data structures. The measures I took to reduce stress weren't relevant to why it was necessary to know how basic data structures work, and I didn't mention them.

As someone else has mentioned, the interview process may have changed significantly since I left in 2010. Clearly I don't think Google is perfect, but they try hard to make the interview process as accurate as possible. It's very expensive in terms of opportunity cost alone to interview someone, and every person they stress out in the interview process is going to tell at least 10 good potential hires about their bad experience.

Generally the first thing I did was tell the interviewee that every interview was a clean slate and that I didn't know how they had done in previous interviews and wouldn't tell later interviewers about their performance. I told them the hiring committee takes the independent reports from the various interviewers and throws out any outliers, so they should try and not worry about any one interview. I mentioned that the sheet of paper I had been passed listed which questions they had been asked, but not how the interviewer thought they had done. I explained that I would gradually modify the questions to make them harder and harder until we ran out of the time I had allotted for that particular question, and that I didn't expect them to be able to answer the most difficult parts of the questions in the allotted time. I mentioned that I was more interested in seeing how they thought than seeing one best answer.

I generally tried to structure the interview as a joint problem solving session, trying to use the word "we" as much as possible. I gave hints when people got stuck, and certainly didn't expect them to fully solve the problems without hints. In my writeup, I'd often note if I felt they would have done a bit better if they had been a bit calmer. I really was silently rooting for them to succeed. It's very fun interviewing someone who's bright and excited about the subject matter.

That being said, I had 40 minutes (trying to leave 5 minutes at the end for any questions they had) to try and asses the interviewee's abilities in several categories. I'm not sure how to do that without ratcheting up the difficulty of a question until the interviewee fails, and then repeating with another question. If you don't push them to the point of failure, you don't know where the edges of their abilities are. I tried to minimize the pressure, but in the end, I had to push them until they failed, and failed multiple times. Failure is naturally stressful, and there was only so much I could do to reduce that stress.

The interview process tried to minimize any biases of any one given interviewer, making each judgement as independent as possible, so that meant there needed to be many interviews. There's a lot of opportunity cost spent in interviewing people, and people get worn out if you keep bringing them back in, so that necessitated making each interview short.

I knew a fair number of full-time Googlers who started out as contractors. I'm sure already working at Google helped take a lot of the pressure off, but they still went through technical interviews.

Anyway, the interview process was far from perfect, I'm just saying that it's perfectly reasonable for professionals to be able to cram freshman/sophomore level data structures in order to interview.


Yes, it sounds like you took your role very seriously and the applicants were probably extremely grateful.

I just think the whole process as you recount it seems inefficent and probably as useful as a fraternity hazing at getting the desired result. That the process is daunting is used to screen out the undeseriables even when completing the gauntlet has very low truth value on the question of performance in the role. But I'm sure that Google has some A/B testing right? Like measure some candidates effectiveness on the job that didn't go through the process (like I dunno low badge numbers) against those that did go through the process. (I had to sneak that in, lol?)

Not a bad thing in an interview, measuring how the candidate responds to failure/blocking as long as it is done honestly because creative solutions and overcoming failure/roadblocks is pretty much the same thing in engineering.

I have to agree to disagree with you on your conclusion. Cramming data structures seems to me like a waste of time.


How would you recommend an interviewer figure out if a candidate has a strong background in data structures and algorithms?


If I had the resources of a Google I would innovate a solution so that the interviewer can focus on the problem that she is actually trying to solve: determining fitness.

Compare this for example to the NFL combine. Or the amount of time and money put into farm systems, feeder teams, and scouting reports. Why not pay the candidates to solve an engineering problem and judge the output? I don't have the answer, but I'm fairly certain that there has to be a better way than stress traps and good-ole-boy networks.

I work very hard to tailor my interview questions to the actual background of each particular applicant so that they can in efffect ask their own questions and demonstrate ability (ala "oh that's interesting how did you solve that problem, and what if you did this instead how would that work in the design, here's a whiteboard show me, etc). It is very hard and I regret very much the "good hires" I have probably let go because they couldn't engage well with me or talk about their own experience well. This obviously takes time and effort so I depend on recruitment/referrals properly to weed out wastes of time.

Is anybody working on this problem? It seems clear google isn't and HR innovation is extremely profitable.


What makes you think google hasn't tried your solution, among others?

Google spends a large amount of time, energy, and money trying to figure out the best ways to hire candidates that are "fit", and tries numerous things (simultaneously, in fact) past "asking stress problems". The fact that most people go through the current interview process (which, btw, has visibly changed in the 6.5 years i've been there) does not mean they don't experiment, or use alternative methods. It's simply the most common.


The people adjusting the hiring process are those that passed it. It is a filtering system that picks out a particular type of person and reinforces it. I find it hard to believe that it will change in any way that adjusts the filtering system. People at Google might experiment, but it will be within the orthodoxies, e.g., choosing between hybrid or gasoline vehicles, rather than considering mass transit.


Err, no they aren't, because they aren't all (or even mostly, AFAIK) engineers. It's certainly not the case that engineering and other positions have the same interview process.


I would also qualify that as "Google spends a large amount of time, energy, and money trying to figure out the best ways to hire candidates AT SCALE"

There are literally millions of resumes being submitted every year, and the cost of a false positive is high (saying "yes" to a bad hire is generally much worse than saying NO to a fantastic hire).


You are correct, I don't know and was perhaps overgeneralizing based on the belief that the parent was accurately representing his/her experience as common.

What can you tell me about how the process in your experience has changed in the last six years? Experiments, even social ones, have little value unless their results are shared! ;-)

I'm always looking to improve my HR skills. Hiring is so hard. It's my opinion that one of the reasons it is so hard is (primarily?) because of all the subterfuge/deception involved by both parties of the transaction. I wonder whether the experience of successful dating services have any insights to offer? Seems like a similar problem but without the messiness of physical attraction to get in the way.


>but what I do know is how to find out the answer given a computer and an internet connection.

Then you are not good enough for them. They want someone that breaths and knows these things inside out and backwards.

Any half-competent dabbler can look them up and implement them. Creating novel solutions takes more deep understanding of them than that.


The difference being that novel solutions require actual domain knowledge rather than an n-levels-deep memorization of the entirety of your Algorithms 1 and 2 courses from university.


It's not either/or, you need to understand, not (just) memorize, the algorithms enough to see where they could be applied in or adapted to the domain in question.

Similarly, I could quip that nontrivial novel solutions require "actual" algorithms knowledge rather than exhaustive domain knowledge. Not surprisingly, in reality it's a bit of both.


There is this Shakespearean scholar who doesn't remember any of the works of Shakespeare. You ask him to recite a sonnet his response is they are written down he doesn't have to memorize them. Doesn't even own a print copy Google after all right? He is trying to come up with a new understanding of the works of Shakespeare.

You have to have the assumptions down cold before you can challenge them. You have to have the basics down cold before you can reshape them. Having the fundamentals down so cold that you can apply and work with them is the beginning of the path not the end.


Ok, you make a certain point, and I'm going to correct myself.

Google does not interview based on the material from Algorithms 1 & 2 in university. Merely having the fundamentals taught in undergrad down-pat will not actually help much on a Google interview.

Google interviews based on a database of Interview Questions, which are largely unique to the BigCo recruiting process. They're not real problems I've ever seen anyone actually have to solve; they're basically checks to see if you have studied the subject of BigCo coding interviews in itself.

To repeat my example from below: how do you find all the identical anagrams of words in a text, given the text? This is not a problem taught in Algorithms class, and in fact, there are several different ways to apply Algorithms 1&2 knowledge to the problem. Only one of those (tokenize and sort the tokens in lexical ordering to get a "canonical" form of the string, then build your own key-value map from canonical strings to bags of words) will get you the job, and in fact it's the one that involves pretending you can't build sets of characters and use them as keys to a hash table. In fact, if you could hash character-sets, it's much cheaper to do so rather than employ a counting sort followed by a key-value store.

Actually, as I remember, we didn't (and don't) do much with key-value data structures in Algorithms class at all, not as I took it, and not as I've TA'd it. The question is used precisely because it tests your knowledge of How to Play the Interview Game rather than your fundamental computing knowledge.


Err, plenty of answers would get you the job besides sorting, including order insensitive hash functions and tables, neat automata based solutions, etc.

FWIW: I've hired people who even never got an "optimal" answer, but had brilliant ways of thinking about it that didn't turn out to be better/faster.

In any case, they deliberately ask questions that require applying knowledge, and designing an algorithm, rather than memorizing the way to reverse a circular linked list or whatever. So they are questions designed to test whether you can solve problems you haven't seen before by applying basic computer science knowledge.

I get/can afford to spend 45 minutes with a candidate. If you've got a better way to understand whether a candidate can apply computer science knowledge, I'd love to hear it.

Most problems being worked on for real would take at least 20-30 minutes to introduce (probably a lot longer to explain constraints, etc) Nobody is going to have interesting insights in fifteen minutes, and engineers are too busy to spend hours per candidate.

If you can't see how the anagrams question tests basic data structure and algorithms knowledge, I don't know what to tell you.


Err, plenty of answers would get you the job besides sorting, including order insensitive hash functions and tables, neat automata based solutions, etc.

I actually gave hashing/hash-table of a small character-bag of my own making as the first answer. The interviewer then told me to solve the problem "without any [standard-library] data-structures". I had to question that, and was told I had to come up with everything myself, as if writing my own standard library in C. The interviewer was then satisfied when I gave the sorting answer, and I did get the positive call-back.

It was a matter of giving him what he wanted, not necessarily just solving the problem.


Your interviewer was an idiot. Again, if this was at google, please let me know privately who it was, and i'll take care of it.


You seem to make the assumption that an interview question should be representative of the work you would do on the job. You could also see such questions as measuring a variable of interest to the interviewers; in this case that would be creative problem solving. The fact that it's neither something you did in algorithms class nor something you will do on the job is on purpose, it forces you to come up with something new.


A Shakespearean scholar who doesn't remember any of the works of Shakespeare? Does he have some kind of memory problem? He's spent years of his life examining the works of Shakespeare in detail, reading around the subject, probably watching play after play, travelling to see new interpretations being put on, and he can't quote me a single sonnet? Either he's incompetent, or he has some kind of memory problem.

Now extend to algorithms. I wouldn't hire a programmer like this Shakespearean scholar.


The analogy would be perfect if "working programmer" and "algorithms researcher" were the same thing.


The domain knowledge when it comes to implementing algorithms is algorithms themselves...


Yes, but the algorithms relevant to a domain are not Algorithms 1&2 class algorithms. For example, writing a network protocol requires knowledge of finite-state machines, writing a type-inference engine requires knowledge of unification, writing a thread scheduler requires knowledge of various scheduling algorithms.

Knowing the most efficient way to find all the anagrams in a text is not relevant to any of them, though. And yes, I was given that problem in an interview last week. I got it right, but the actual position involved more knowledge of architecture, assembly, and compilation than of Algorithms Guru Yoga.


Well, I hope you weren't asked that at Google , since it's a banned question. If you were, let me know privately :)

I assume you understand the goal of these questions is not to figure out if you can perform memorization of algorithms 1 + 2, but to understand how you think about problems and watch you do it?

Note that, at least at Google, they still try to hire generalists (in most cases), so you would still get that "kind" of question there, even for a job involving architecture and assembly. This is a deliberate choice.

Of course, i'm sure you also realize that a lot of the algorithms in "architecture, assembly, and compilation" were developed by people who started out more as "Algorithms Guru Yoga" folks.

Gregory Chaitan, who developed Graph Coloring for register allocation, was definitely not a compiler guy

Only half the folks on the original conference paper for Static Single Assignment form for compilers were compiler people actually trying to solve that problem.

Robert Tarjan was not a compiler guy, but his scheme for computing dominators is still the main one used today, and his union-find algorithms (he also proved the upper bound on the already-known versions) are what are used for a lot of type unification algorithms.

The list goes on of "Algorithms Guru Yoga" folks who came up with the premiere solutions to problems in the field you are talking about.

So when you say "novel solutions require actual domain knowledge rather than an n-levels-deep memorization of the entirety of your Algorithms 1 and 2 courses from university.", it's a bit hard to take that seriously.

Novel solutions require both domain level knowledge, and the understanding you learned in Algorithms 1 and 2. Even the question you got about anagrams does not require any n-level wrote memorization. It requires only the most basic understanding of data structures, and an understanding of how to design algorithms.


Why was it banned?


Because this one is very widely known, shows up in books on interviewing at google, etc, so it fails to be a question that is useful for evaluating anymore.


At the very least, I can assure you it wasn't Google asking that question. It was, however, a well-known BigCo who employ quite a lot of software engineers and are generally known as the best at what they do.

Now, in answer to the rest of your post: you are broadly overestimating the curriculum of Algorithms 1 & 2. That is, Algorithms 1 & 2 + knowing you need to allocate registers is not going to necessarily lead you to graph coloring.

I assume you understand the goal of these questions is not to figure out if you can perform memorization of algorithms 1 + 2, but to understand how you think about problems and watch you do it?

I have heard this explanation; I simply don't see the evidence as in its favor. Basically, if you were looking for real problem-solving skill, you would have to entirely abandon the 45-minute interview format, whether it's on the phone or in-person.

That is, real problem-solving, as I've seen it and done it, tends to involve some kind of real problem domain (so you know what sort of solution is desirable), a real resource allocation (so you know what performance trade-offs to make), a less "academic exam" set-up (so you can experiment with solutions and see how they look), and more time to think than 20-35 minutes.

Interview environments are: relatively domainless, have no set resource allocation constraints other than "do well at Big-O performance measures", come as "pass/fail tests" rather than collaborative solving efforts, and are given with heavy time-pressure.

"My normal methods are useless here" -- those being to sketch out many solutions in a notebook over a period of hours or even days. Worse: the effect is almost to punish people who've had the audacity to spend significant portions of our careers in specific domains, from web-dev to type theory.

Also worse: interviewers tend to be unsatisfied until given their favored solution to their chosen question. In the example cited above, my interviewer was unhappy with my choice of a hash-table for a key-value store and actually asked me to go back and redo the problem "without using preexisting data structures". He ended up happy (I was called to do a second interview) when I told him we could make an insertion-sorted linked list of key-value mappings.

Why didn't I think of that in the first place? Because an insertion-sorted linked list performs worse than a hash-table as a key-value mapping data structure. A hash-table is, as far as I know, writing calmly and without pressure, the go-to key-value map for real programming. But most interviewers I've met are somewhat disappointed at seeing a hash-table, because using low-level data structures from the standard library is what stupid people do. Of course, it's also what experienced working programmers do.

Again, this one wasn't at Google, but it's a fairly good example of how BigCo interviewers come off as trying to play a round of Who's Mister Clever? more than actually engage in a problem-solving process. Other favored tactics include dinging you for what programming language you use, or questioning the syntax of your white-boarded code (had this happen to me in the same interview, turned out I was right when we asked one of the interviewer's colleagues).


Or maybe it's self-selection bias.


I don't understand what this is supposed to add to the observation that these subjects are important at Google interviews. It seems tautological to refer to interview requirements as "selection bias".


What I meant is that maybe the people working at Google were interested in algorithms even before working at Google and therefore they went to work for Google and therefore they work on algo in their free time. I did not say selection bias, I said self-selection bias http://en.wikipedia.org/wiki/Self-selection_bias


I would recommend to have a look at sorted arrays first. There is an implementation available in Boost:

http://www.boost.org/doc/libs/1_52_0/doc/html/container/non_...


Sorted arrays are a common choice that I've seen people go to in lieu of std::set, but a proper B-tree implementation kicks butt compared to a sorted array for performance when mutating data, and actually an unsorted array tends to kick but for small array sizes (<100 elements).


Anyone know why the STL doesn't use B trees? Are there use cases where Red-Black trees are better?


The STL guarantees both iterator and address stability for the elements of a set, map, or the unordered variants. Those guarantees can't be provided by many more compact representations.

Note that this rules out more than just B-trees: open addressed hash tables and judy array like structures have the same issue.


What do you mean by "iterator and address stability"?


Iterator stability means iterators that point into a container are not invalidated by operations (such as insertions or deletions) on that container.

Address stability means that addresses of elements are not changed by container operations. If you think about it, you can see that maintaining this property rules out contiguous storage of any subset of a container's elements, hence my comment about "compact representations".


It says in the article that B-trees are advantageous when the cost of cache-misses would be worse than the cost of comparisons; e.g., in the case of integers. In comparison, a Red-Black tree would do less comparisons, and hash-table is faster on average but is unordered (more differences?).


In other words, benchmark heavily. And unless you have trees of integers, don't expect the same results.


Without considering cache-miss, Red-black tree is much faster than B-tree. So this b-tree implementation must be tailored to cache-line size, and it may performs badly on some CPUs.

The performance gain on the B-tree container is also largely depends on the size of value. So choose this one carefully.


> So this b-tree implementation must be tailored to cache-line size, and it may performs badly on some CPUs.

Um, no, Btrees are always going to win here as long as key size is relatively small compared to the number of items in the structure. Btrees, for small keyed associations, have a much better "branching factor per cache line." This means that fewer cache misses are needed to find relevant items (insert, delete, read, …), because the trees are shallower, and the cost of a cache miss on a btree node isn't much more expensive than an R-B node.

Benchmark everything and tune your B-Tree container size appropriately, definitely. You could even instantiate the template with multiple different sizes at compile time and benchmark them at run-time, then elect the most efficient implementation for the hardware currently being used.


I don't really think it is possible to say one is better than another without a specific application in mind. Probably just chose whatever one they wanted at the time.


You can, you can talk about which operations need to be fast, as shown by the big-O complexity. You can also talk about constant factors, the cost of specific operations such as comparisons, and locality of reference. In my opinion the latter aspects should be emphasized more because you sometimes come across people who built their careers around slightly more optimal datastructures in terms of big-O complexity, which are nevertheless impractical in any but the most extreme situations due to constant factors (which may only get a footnote mention).


You don't seem to be disagreeing. The point was that you need to know which properties matter to your application before you can truly decide where you should worry about emphasis.

But, for those that want a good example of what you are talking about, it seems to mirror the recent post by Carmack regarding ray tracing. If I recall, the point was that, though ray tracing is better in O terms, it has an outrageous constant factor.


I was disagreeing with the notion that you can only say which algorithm to use if you know the specific application. There are general trade-offs between algorithms which hold independent of any specific application.


But the thought is that the specific application will determine which trade-offs are acceptable and/or applicable. So... I'm still not sure how you are disagreeing.

Now, you can list out the various tradeoffs easily enough. And some algorithms are superior to others, I would imagine. However, which to use and the impact it will have on an application depend more on the application than the datastructure.

Or, am I completely off my rocker?


Another wrinkle is that the constant factors can be data-dependent: which data structure to choose can heavily depend on the distribution of data you typically see.


I'm confuse. From my understanding of B-Tree, B-Tree are good when you load data from hard drive. It allow to minimize the number of time you load a range of data from the hard drive. Since C++B-Tree is in RAM, and thus the data loading should be less important, how does it appears to have a so good benchmark? What is the secret ingredient?


If I had to guess, I'd say better cache locality. It's tremendously more expensive to go to main memory for something than to go to the cache.


One question: Is this something similar to the CSB+ Tree (Cache Conscious B+ Tree)

http://people.cs.aau.dk/~simas/dat5_07/papers/p475-rao.pdf

Or are the optimisations mentioned there still valid?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: