It’s not too hard to combine the sort command with find’s formatting options to locate the largest files in a directory hierarchy. As is so often the case though, one size does not fit all.
From the Linux shell, finding all the files under a directory hierarchy is pretty straightforward:-
$ find /usr -type f -print | more /usr/lib64/librpcsvc.a /usr/lib64/libeggdbus-1.so.0.0.0 /usr/lib64/libFestival.so.1.96.0 …
And the find command supports all sorts of helpful options to narrow down our search – for example to find files that have been modified within the last few days, which are owned by a particular user or with a name matching a specified pattern:-
$ find /usr -type f -name "*.txt" -print /usr/lib64/xorg/protocol.txt /usr/lib64/python2.6/site-packages/PyXML-0.8.4-py2.6.egg-info/SOURCES.txt /usr/lib64/python2.6/site-packages/PyXML-0.8.4-py2.6.egg-info/top_level.txt …
But what about finding the largest file within a directory, or say the largest 5?
Sizing the problem
The first step here is to get size information out of the find command. We could certainly do this by executing an ls –l on each matching file found:-
$ find /usr -type f -name "*.txt" -exec ls -l {} \; -rw-r--r--. 1 root root 31246 Mar 28 2012 /usr/lib64/xorg/protocol.txt -rw-r--r--. 1 root root 14233 Jul 20 2010 /usr/lib64/python2.6/site-packages/PyXML-0.8.4-py2.6.egg-info/SOURCES.txt -rw-r--r--. 1 root root 9 Jul 20 2010 /usr/lib64/python2.6/site-packages/PyXML-0.8.4-py2.6.egg-info/top_level.txt
This works, though it isn’t going to be terribly efficient on a large result set, nor is the information returned especially easy for us to process. Using the –ls flag is tidier and more efficient:-
$ find /usr -type f -name "*.txt" -ls 25056 32 -rw-r--r-- 1 root root 31246 Mar 28 2012 /usr/lib64/xorg/protocol.txt 16367 16 -rw-r--r-- 1 root root 14233 Jul 20 2010 /usr/lib64/python2.6/site-packages/PyXML-0.8.4-py2.6.egg-info/SOURCES.txt 16369 4 -rw-r--r-- 1 root root 9 Jul 20 2010 /usr/lib64/python2.6/site-packages/PyXML-0.8.4-py2.6.egg-info/top_level.txt …
But this still has rather a lot of noise which complicates our processing of the output. We really just want to capture the path and the size – kinda like the output of –print but with the size included.
Enter the –printf option, which gives us access to a lot of the information find knows about a file and quite tight control over how we format it.
$ find /usr -type f -name "*.txt" -printf "%s %p\\n" 31246 /usr/lib64/xorg/protocol.txt 14233 /usr/lib64/python2.6/site-packages/PyXML-0.8.4-py2.6.egg-info/SOURCES.txt 9 /usr/lib64/python2.6/site-packages/PyXML-0.8.4-py2.6.egg-info/top_level.txt …
For each matching file here we’re writing out the size in bytes, (%s) followed by a space and then the file’s path (%p) wrapped up with a newline (\\n).
Armed with just the size and name information formatted in this way, all we need to do is use sort –n to sort the output in ascending numeric order, and finish it off with tail to get the biggest one:-
$ find /usr -type f -name "*.txt" -printf "%s %p\\n" | sort -n | tail -1 2639355 /usr/share/hwdata/oui.txt
Or how about the biggest 5:-
$ find /usr -type f -name "*.txt" -printf "%s %p\\n" | sort -n | tail -5 763795 /usr/share/perl5/unicore/LineBreak.txt 938846 /usr/share/perl5/unicore/NamesList.txt 1117369 /usr/share/perl5/unicore/UnicodeData.txt 1270011 /usr/share/perl5/Unicode/Collate/allkeys.txt 2639355 /usr/share/hwdata/oui.txt
-printf: unknown primary or operator
As a wise man once said, the great thing about standards is there’s so many to choose from. The above works wonderfully on my Red Hat and CentOS boxes, but on my Mac it barfs:-
$ find /usr -type f -name "*.txt" -printf "%s %p\\n" | sort -n | tail -5 find: -printf: unknown primary or operator
Okay, so let’s take a look at the output from the –ls option on this box:-
$ find /usr -type f -name "*.txt" -ls 89749713 0 -r--r--r-- 1 root wheel 137 31 Jul 00:05 /usr/share/cups/ipptool/testfile.txt 89749917 0 -r--r--r-- 1 root wheel 95 31 Jul 00:06 /usr/share/doc/cups/robots.txt 89749926 0 -rw-r--r-- 1 root wheel 3511 30 Jul 22:04 /usr/share/doc/groff/1.19.2/examples/mom/README.txt …
The details we need are certainly in there, the size of the file in bytes is the seventh field and the path is the eleventh. Using the cut command to slice these out is an option, but we’d have to use character indices which may be a little brittle.
How about squirting it through awk instead? This will greedily consume the variable whitespace between our fields for free:-
$ find /usr -type f -name "*.txt" -ls | awk '{print $7 " " $11}' 137 /usr/share/cups/ipptool/testfile.txt 95 /usr/share/doc/cups/robots.txt 3511 /usr/share/doc/groff/1.19.2/examples/mom/README.txt …
Perfect! We’ve got exactly the same format which -printf “%s %p\\n” was giving us. We just need to tack on our sort –n and tail and we’re good to go:-
$ find /usr -type f -name "*.txt" -ls | awk '{print $7 " " $11}' | sort -n | tail -5 308561 /usr/share/vim/vim74/doc/version5.txt 348909 /usr/share/vim/vim74/doc/eval.txt 362485 /usr/share/vim/vim74/doc/options.txt 577047 /usr/share/vim/vim74/doc/version6.txt 674805 /usr/share/vim/vim74/doc/version7.txt
So far I’ve only used these commands to get a list of files to triage manually – if you’re going use this sort of thing as the basis for automated deletion or archiving I’d check how this works with any “exotic” file types you might have lurking in your directory hierarchy first.