Saturday, December 1, 2012

Shell: Summing up lots of (large) numbers

Sometimes you want to know the exact amount of bytes all the files in a directory tree takes. For example, checking the sum of sizes of all files is a quick way to see if a copy operation went OK - if they are the same, there are reasons to believe that it is OK.

du gives varying numbers from filesystem to filesystem

However, 'du' doesn't always give the right number - is it the size of the directory nodes causing differences?
du -sb .

awk-solutions have major precision problems

The solutions flying across the intertubes using awk, goes awry with large numbers. The following have a ceiling on 2147483647, as that is the max of a 32 bit integer number. Absurdly, awk just displays that if it reaches the limit.
find . -type f -printf "%s \n" | awk '{sum += $1 } END {printf "%i bytes\n",sum }'

You can get around that by going to floats (awk really uses double-precision floating-point for all numbers), but then you loose the whole point, exactness:
find . -type f -printf "%s \n" | awk '{sum += $1 } END { printf "%.1f bytes\n",sum }'

The Solution: bc - arbitrary precision calculator language!

Finding the sum of all the files in a directory tree:
echo `find . -type f -printf "%s+"`0 | bc

Same in GiB (GigaBytes as in 1024^3 bytes), using absurd scale to get exactness ('scale' is bc's concept of decimal precision) :
echo scale=50\; \(`find . -type f -printf "%s+"`'0)/(1024^3)' | bc

If you want MiB or KiB, then change the ^3 to ^2 or ^1 respectively.