Linux – Finding the largest files in a directory tree

It’s not too hard to combine the sort command with find’s formatting options to locate the largest files in a directory hierarchy. As is so often the case though, one size does not fit all.

From the Linux shell, finding all the files under a directory hierarchy is pretty straightforward:-

$ find /usr -type f -print | more
/usr/lib64/librpcsvc.a
/usr/lib64/libeggdbus-1.so.0.0.0
/usr/lib64/libFestival.so.1.96.0
…

And the find command supports all sorts of helpful options to narrow down our search – for example to find files that have been modified within the last few days, which are owned by a particular user or with a name matching a specified pattern:-

$ find /usr -type f -name "*.txt" -print
/usr/lib64/xorg/protocol.txt
/usr/lib64/python2.6/site-packages/PyXML-0.8.4-py2.6.egg-info/SOURCES.txt
/usr/lib64/python2.6/site-packages/PyXML-0.8.4-py2.6.egg-info/top_level.txt
…

But what about finding the largest file within a directory, or say the largest 5?

Sizing the problem

The first step here is to get size information out of the find command. We could certainly do this by executing an ls –l on each matching file found:-

$ find /usr -type f -name "*.txt" -exec ls -l {} \;
-rw-r--r--. 1 root root 31246 Mar 28  2012 /usr/lib64/xorg/protocol.txt
-rw-r--r--. 1 root root 14233 Jul 20  2010 /usr/lib64/python2.6/site-packages/PyXML-0.8.4-py2.6.egg-info/SOURCES.txt
-rw-r--r--. 1 root root 9 Jul 20  2010 /usr/lib64/python2.6/site-packages/PyXML-0.8.4-py2.6.egg-info/top_level.txt

This works, though it isn’t going to be terribly efficient on a large result set, nor is the information returned especially easy for us to process. Using the –ls flag is tidier and more efficient:-

$ find /usr -type f -name "*.txt" -ls
 25056   32 -rw-r--r--   1 root     root        31246 Mar 28  2012 /usr/lib64/xorg/protocol.txt
 16367   16 -rw-r--r--   1 root     root        14233 Jul 20  2010 /usr/lib64/python2.6/site-packages/PyXML-0.8.4-py2.6.egg-info/SOURCES.txt
 16369    4 -rw-r--r--   1 root     root            9 Jul 20  2010 /usr/lib64/python2.6/site-packages/PyXML-0.8.4-py2.6.egg-info/top_level.txt
…

But this still has rather a lot of noise which complicates our processing of the output. We really just want to capture the path and the size – kinda like the output of –print but with the size included.

Enter the –printf option, which gives us access to a lot of the information find knows about a file and quite tight control over how we format it.

$ find /usr -type f -name "*.txt" -printf "%s %p\\n"
31246 /usr/lib64/xorg/protocol.txt
14233 /usr/lib64/python2.6/site-packages/PyXML-0.8.4-py2.6.egg-info/SOURCES.txt
9 /usr/lib64/python2.6/site-packages/PyXML-0.8.4-py2.6.egg-info/top_level.txt
…

For each matching file here we’re writing out the size in bytes, (%s) followed by a space and then the file’s path (%p) wrapped up with a newline (\\n).

Armed with just the size and name information formatted in this way, all we need to do is use sort –n to sort the output in ascending numeric order, and finish it off with tail to get the biggest one:-

$ find /usr -type f -name "*.txt" -printf "%s %p\\n" | sort -n | tail -1
2639355 /usr/share/hwdata/oui.txt

Or how about the biggest 5:-

$ find /usr -type f -name "*.txt" -printf "%s %p\\n" | sort -n | tail -5
763795 /usr/share/perl5/unicore/LineBreak.txt
938846 /usr/share/perl5/unicore/NamesList.txt
1117369 /usr/share/perl5/unicore/UnicodeData.txt
1270011 /usr/share/perl5/Unicode/Collate/allkeys.txt
2639355 /usr/share/hwdata/oui.txt

-printf: unknown primary or operator

As a wise man once said, the great thing about standards is there’s so many to choose from. The above works wonderfully on my Red Hat and CentOS boxes, but on my Mac it barfs:-

$ find /usr -type f -name "*.txt" -printf "%s %p\\n" | sort -n | tail -5
find: -printf: unknown primary or operator

Okay, so let’s take a look at the output from the –ls option on this box:-

$ find /usr -type f -name "*.txt" -ls
89749713        0 -r--r--r--    1 root             wheel                 137 31 Jul 00:05 /usr/share/cups/ipptool/testfile.txt
89749917        0 -r--r--r--    1 root             wheel                  95 31 Jul 00:06 /usr/share/doc/cups/robots.txt
89749926        0 -rw-r--r--    1 root             wheel                3511 30 Jul 22:04 /usr/share/doc/groff/1.19.2/examples/mom/README.txt
…

The details we need are certainly in there, the size of the file in bytes is the seventh field and the path is the eleventh. Using the cut command to slice these out is an option, but we’d have to use character indices which may be a little brittle.

How about squirting it through awk instead? This will greedily consume the variable whitespace between our fields for free:-

$ find /usr -type f -name "*.txt" -ls | awk '{print $7 " " $11}'
137 /usr/share/cups/ipptool/testfile.txt
95 /usr/share/doc/cups/robots.txt
3511 /usr/share/doc/groff/1.19.2/examples/mom/README.txt
…

Perfect! We’ve got exactly the same format which -printf “%s %p\\n” was giving us. We just need to tack on our sort –n and tail and we’re good to go:-

$ find /usr -type f -name "*.txt" -ls | awk '{print $7 " " $11}' | sort -n | tail -5
308561 /usr/share/vim/vim74/doc/version5.txt
348909 /usr/share/vim/vim74/doc/eval.txt
362485 /usr/share/vim/vim74/doc/options.txt
577047 /usr/share/vim/vim74/doc/version6.txt
674805 /usr/share/vim/vim74/doc/version7.txt

So far I’ve only used these commands to get a list of files to triage manually – if you’re going use this sort of thing as the basis for automated deletion or archiving I’d check how this works with any “exotic” file types you might have lurking in your directory hierarchy first.

Leave a Reply

Your email address will not be published. Required fields are marked *