We forget how big "big" is

Talk with any data-scientist and they'll rant about how they hate the phrase "big data". Odds are they'll mention a story like the following:

My employer came to me and said we want to do some 'big data' work, so we're hiring a consultant to build a Hadoop cluster. So I asked, "How much data do you have?" and he replied, "Oh, we never really measured. But it's big. Really big! BIIIIG!!

Of course I did some back of the envelope calculations and replied, "You don't need Hadoop. We can fit that in RAM if you buy a big enough Dell." he didn't believe me. So I went to dell.com and showed him a server that could support twice that amount, for less than the cost of the consultant.

We also don't seem to appreciate just how fast computers have gotten.

Someone recently asked me about the typical interview question: What is the fastest algorithm to find the largest number in an unsorted array?

Well, duh, if you have to touch every item in the array (because it is unsorted) then the best you can do is O(n).

But wait... there is a difference between how I'd answer this on a CS quiz and during a job interview.

In a job interview I never answer a question without first asking at least one clarifying question. A good one for this situation is, "How many items in the array?"

If the amount is fewer than a few thousand, the answer is: IT DOESN'T MATTER. CPUs can do a linear search through that much RAM so quickly that you could comb through it one bit at a time so quickly that the comb would catch on fire and you'd be left asking yourself, "OMG! WHY DID I USE A FLAMMABLE PLASTIC COMB!?!?"

If it was millions, you could assign a fraction of the job to each core and, on an 18-core CPU, you'd be amazed at how fast your comb would melt.

However the real question is... why are you doing this at all?

There is probably an earlier step where you iterate through the array and could collect the largest element as part of that loop. Since the major time consumer is looping, not the comparison, this makes finding the largest value nearly free. If you code it right, the extra if current > biggest then biggest = current instructions will execute during the time when the CPU is paused waiting for RAM.

Yes, this is just an interview question to see if the candidate understands O() notation/analysis. However there are much better questions that will determine that. For example, have them write code that performs some kind of string search and then ask what O() their algorithm is. If you don't ask candidates to write code during your interview, you aren't doing a good interview.

An operational perspective makes the question even more moot. I once saw a process with 5 steps. Each step was done by different team. Each team began by sorting the data. If the first team sorted the data and retained the result for future teams, the time spent sorting would have been improved 5x! The problem, of course, is that this would have required... talking to people. Oh the horror! Please don't make me be a human with social skills! I want to be a coder that lives in total isolation!

No you don't. You want to be a human that lives in a community that develops awesome processes that are driven by software.

So if this question was a homework assignment... the answer is O(N): do a linear search through the list and examine every single item.

If this question is an honest request about a real situation... Put down the keyboard. Get off your ass and walk down the hall. Talk to all the people that are involved and figure out what real need is and re-examine whether the "largest value" is actually needed, how it is needed, and whether the entire process can be improved by looking at it end-to-end.

Get out of your silo. Talk to people. You'll always get better results.

Posted by Tom Limoncelli in Rants

No TrackBacks

TrackBack URL: http://everythingsysadmin.com/cgi-bin/mt-tb.cgi/1968

Leave a comment

 
  • Don't Miss Out - Register Today