Awesome Conferences

Simple bucket-ized stats in awk

Someone recently asked how to take a bunch of numbers from STDIN and then break them down into distribution buckets. This is simple enough that it should be do-able in awk.

Here's a simple script that will generate 100 random numbers. Bucketize them to the nearest multiple of 10, print based on # of items in bucket:

while true ; do echo $[ 1 + $[ RANDOM % 100 ]] ; done | head -100 | awk '{ bucket = int(($1 + 5) / 10) * 10 ; arr[bucket]++} END { for (i in arr) {print i, arr[i] }}' | sort -k2n,2 -k1n,1

Many people don't know that in bash, a single quote can go over multiple lines. This makes it very easy to put a little bit of awk right in the middle of your code, eliminating the need for a second file that contains the awk code itself. Since you can put newlines anywhere, you can make it very readable:


while true ; do
  echo $[ 1 + $[ RANDOM % 100 ]]
done | head -100 | \
  awk '
        bucket = int(($1 + 5) / 10) * 10 ;
      END {
        for (i in arr) {
          print i, arr[i]
' | sort -k2n,2 -k1n,1

If you want to sort by the buckets, change the sort to sort -k1n,1 -k2n,2

If you want to be a little more fancy, separate out the bucket function into a separate function. What? awk can do functions? Sure it can. You can also import values from the environment using the -v flag.


# Bucketize stdin to nearest multiple of argv[1], or 10 if no args given.
# "nearest" means 0..4.999 -> 0, 5..14.999 -> 10, etc.

# Usage:
# while true ; do echo $[ 1 + $[ RANDOM % 100 ]]; done | head -99 | 8

awk -v multiple="${1:-10}" '

function bucketize(a) {
  # Round to the nearest multiple of "multiple"
  #  (nearest... i.e. may round up or down)
  return int((a + (multiple/2)) / multiple) * multiple;

# All lines get bucketized.
{ arr[bucketize($1)]++ }

# When done, output the array.
  for (i in arr) {
    print i, arr[i]
' | sort -k2n,2 -k1n,1

I generally use Python for scripting but for something this short, awk makes sense. Sadly using awk has become a lost art.

Posted by Tom Limoncelli


Not that I disagree but why has awk become a lost art? Is it too difficult or has the latest generation come at System Administration due to necessity of a platform for their code?

I see ruby / python competing for the development time of the modern SA. But why? Is it because they are the language of the tools being used or is it something else?

I think it is simply that new sysadmins don't learn awk. Perl and Python are more popular. Therefore, only people that have been sysadmins longer know awk. It is a shame, since it is very handy.

Nowadays I tend to use AWK just for the smallest of things, like no-line-breaking one-liners. The problem you posed can be solved in Perl (and I bet in Python and Ruby too) with less code and more clarity:

#!/usr/bin/env perl
use integer;
for (1 .. 100) {
    $arr{((int(rand(100)) + 6) / 10) * 10}++;
foreach my $i (sort {$arr{$a}  $arr{$b} || $a  $b} keys %arr) {
    print "$i $arr{$i}\n";

(Of course, if it were a bigger script I would "use strict", pre-declare the hash, and make a few other embelishments.)