Simple bucket-ized stats in awk

Someone recently asked how to take a bunch of numbers from STDIN and then break them down into distribution buckets. This is simple enough that it should be do-able in awk.

Here's a simple script that will generate 100 random numbers. Bucketize them to the nearest multiple of 10, print based on # of items in bucket:

while true ; do echo $[ 1 + $[ RANDOM % 100 ]] ; done | head -100 | awk '{ bucket = int(($1 + 5) / 10) * 10 ; arr[bucket]++} END { for (i in arr) {print i, arr[i] }}' | sort -k2n,2 -k1n,1

Many people don't know that in bash, a single quote can go over multiple lines. This makes it very easy to put a little bit of awk right in the middle of your code, eliminating the need for a second file that contains the awk code itself. Since you can put newlines anywhere, you can make it very readable:

#!/bin/bash

while true ; do
  echo $[ 1 + $[ RANDOM % 100 ]]
done | head -100 | \
  awk '
      {
        bucket = int(($1 + 5) / 10) * 10 ;
        arr[bucket]++
      }
      END {
        for (i in arr) {
          print i, arr[i]
        }
      }
' | sort -k2n,2 -k1n,1

If you want to sort by the buckets, change the sort to sort -k1n,1 -k2n,2

If you want to be a little more fancy, separate out the bucket function into a separate function. What? awk can do functions? Sure it can. You can also import values from the environment using the -v flag.

#!/bin/bash

# Bucketize stdin to nearest multiple of argv[1], or 10 if no args given.
# "nearest" means 0..4.999 -> 0, 5..14.999 -> 10, etc.

# Usage:
# while true ; do echo $[ 1 + $[ RANDOM % 100 ]]; done | head -99 | bucket.sh 8

awk -v multiple="${1:-10}" '

function bucketize(a) {
  # Round to the nearest multiple of "multiple"
  #  (nearest... i.e. may round up or down)
  return int((a + (multiple/2)) / multiple) * multiple;
}

# All lines get bucketized.
{ arr[bucketize($1)]++ }

# When done, output the array.
END {
  for (i in arr) {
    print i, arr[i]
  }
}
' | sort -k2n,2 -k1n,1

I generally use Python for scripting but for something this short, awk makes sense. Sadly using awk has become a lost art.

Posted by Tom Limoncelli

Comments (3)
Tweet

3 Comments

DMonTech | August 15, 2014 12:09 PM

Not that I disagree but why has awk become a lost art? Is it too difficult or has the latest generation come at System Administration due to necessity of a platform for their code?

I see ruby / python competing for the development time of the modern SA. But why? Is it because they are the language of the tools being used or is it something else?

Tom Limoncelli replied to comment from DMonTech | August 15, 2014 2:54 PM

I think it is simply that new sysadmins don't learn awk. Perl and Python are more popular. Therefore, only people that have been sysadmins longer know awk. It is a shame, since it is very handy.

https://www.google.com/accounts/o8/id?id=AItOawmtqMWgOgRzkv1hw6LE9kPlyNMkamOyfyw | August 15, 2014 7:17 PM

Nowadays I tend to use AWK just for the smallest of things, like no-line-breaking one-liners. The problem you posed can be solved in Perl (and I bet in Python and Ruby too) with less code and more clarity:

#!/usr/bin/env perl
use integer;
for (1 .. 100) {
    $arr{((int(rand(100)) + 6) / 10) * 10}++;
}
foreach my $i (sort {$arr{$a}  $arr{$b} || $a  $b} keys %arr) {
    print "$i $arr{$i}\n";
}

(Of course, if it were a bigger script I would "use strict", pre-declare the hash, and make a few other embelishments.)

Awesome Conferences

3 Comments

Best of Blog

Navigation

Recent Entries

Search

Archives

RSS Feed

Credits