"Linux Gazette...making Linux just a little more fun!"


DiskHog: Using Perl and the WWW to Track System Disk Usage

By Ivan Griffin, [email protected]


An irksome job that most system administrators have to perform at some stage or other is the implementation of a disk quota policy. Being a maintainer of quite a few machines (mostly Linux and Solaris, but also including AIX) without system enforced quotas, I needed an automatic way of tracking disk quotas. To this end, I created a Perl script to regularly check users disk usage, and compile a list of the largest hoggers of disk space. Hopefully, in this way, I can politely intimidate people into reducing the size of their home directories when they get ridiculously large.

The du command summarises disk usage for a given directory hierarchy. When run in each users home directory, it can report how much disk space the user is occupying. At first, I had written a shell script to run du on a number of user directories, with an awk back-end to provide nice formatting of the output. This proved difficult to maintain if new users were added to the system. Users home directories were unfortunately located in different places on each operating system.

Perl provided a convenient method of rewriting the shell / awk scripts into a single executable, which not only provided more power and flexibility but also ran faster! Perl's integration of standard Unix system calls and C library functions (such as getpwnam() and getgrname()) makes it perfectly suited to tasks like this. Rather than provide a tutorial on the Perl language, in this article I will describe how I used Perl as a solution to my particular need. The complete source code to the Perl script is shown in listing 1.

The first thing I did was to make a list of the locations in which users home directories resided, and isolate this into a Perl array. For each sub-directory in the directories listed in this array, a disk usage summary was required. This was implemented by using the Perl system command to spawn off a process running du.

The du output was redirected to a temporary file. The temporary file was named using the common $$ syntax, which is replaced at run time by the PID of the executing process. This guaranteed that multiple invocations of my disk usage script (while unlikely) would not clobber each others temporary working data.

All the sub-directories were named after the user who owned the account. This assumption made life a bit easier in writing the Perl script, because I could skip users such as root, bin, etc.

I now had, in my temporary file, a listing of a disk usage and username, one pair per line of the file. I wanted to split these up into an associated hash of users and disk usage, with users as the index key. I also wanted to keep a running total of the entire disk usage, and also the number of users. Once Perl had parsed all this information from the temporary file, I could delete it.

I decided the Perl script would dump its output as an HTML formatted page. This allowed me great flexibility in presentation, and also permitted the information to be available over the local intranet - quite useful when dealing with multiple heterogeneous environments.

Next I had to work out what information I needed to present. Obviously the date when the script had run was important, and a sorted table listing disk usage from largest to smallest was essential. Printing the GCOS information field from the password file allowed me to view both real names, and usernames. I also decided it might be nice to provide a hypertext link to the users homepage, if one existed. So extracting their official home directory from the password file, and adding on to it the standard user directory extensions to it (typically public_html or WWW) allowed this.

Sorting in Perl usually involves the use of the spaceship operator ( <=>). The sort function sorts a list and returns the sorted list value. It comes in many forms, but the form used in the code is:


sort sub_name list

where sub_name is a Perl subroutine. sub_name is call during element comparisons, and it must return an integer less than, equal to, or greater than zero, depending on the desired order of the list elements. sub_name may also be replaced with an inline block of Perl code.

Typically sorting numerically ascending takes the form:


@NewList = sort { $a <=> $b } @List;

whereas sorting numerically descending takes the form:


@NewList = sort { $b <=> $a } @List;

I decided to make the page a bit flashier by adding a few of those omnipresent coloured ball GIFs. Green indicates that the user is within allowed limits. Orange indicates that the user is in a danger buffer zone - no man's land, from which they are dangerously close to the red zone. The red ball indicate a user is over quota, and depending on the severity multiple red balls may be awarded to really greedy, anti-social users.

Finally, I plagued all the web search engines until I found a suitable GIF image of a pigglet, which I included on the top of the page.

The only job left was to include the script to run nightly as a cron job. It needed to be run as root in order to accurately assess the disk usage of each user - otherwise directory permissions could give false results. To edit roots cron entries (called a crontab), first ensure you have the environment variable VISUAL (or EDITOR) set to your favourite editor. Then type


crontab -e

Add the line from listing 2 to any existing crontab entries. The format of crontab entries is straightforward. The first five fields are integers, specifying the minute (0-59), hour (0-23), day of the month (1-31), month of the year (1-12) and day of the week(0-6, 0=Sunday). The use of an asterix as a wild-card to match all values is permitted, as is specifying a list of elements separated by commas, or a range specified by start and end (separated by a minus). The sixth field is the actual program to being scheduled.

A script of this size (which multiple invocations of du) takes some time to process. As a result, it is perfectly suited for scheduling under cron - I have it set to run once a day on most machines (generally during the night, which user activity is low). I believe this script shows the potential of using Perl, Cron and the WWW to report system statistics. Another variant of it I have coded performs an analysis of web server log files. This script has served me well for many months, and I am confident it will serve other sysadmins too.



#!/usr/local/bin/perl -Tw

# $Id: disk_hog.html,v 1.2 2002/10/09 22:24:18 lg Exp $
#
# Listing 1:
# SCRIPT:       diskHog
# AUTHOR:       Ivan Griffin ([email protected])
# DATE:         14 April 1996
#
# REVISION HISTORY:
#   06 Mar 1996 Original version (written using Bourne shell and Awk)
#   14 Apr 1996 Perl rewrite
#   01 Aug 1996 Found piggie image on the web, added second red ball
#   02 Aug 1996 Added third red ball
#   20 Feb 1997 Moved piggie image :-)

#
# outlaw barewords and set up the paranoid stuff
#
use strict 'subs';
use English;

$ENV{'PATH'} = '/bin:/usr/bin:/usr/ucb'; # ucb for Solaris dudes
$ENV{'IFS'} = '';

#
# some initial values and script defines
#
$NumUsers = 0; 
$Total = 0; 
$Position = 0; 

$RED_ZONE3 = 300000;
$RED_ZONE2 = 200000;
$RED_ZONE = 100000;
$ORANGE_ZONE = 50000;

$CRITICAL = 2500000;
$DANGER   = 2200000;

$TmpFile = "/var/tmp/foo$$";
$HtmlFile = '>/home/sysadm/ivan/public_html/diskHog.html';
$PerlWebHome = "diskHog.pl";

$HtmlDir = "WWW";
$HtmlIndexFile = "$HtmlDir/index.html";
$Login = " ";
$HomeDir=" ";
$Gcos = "A user";

@AccountDirs = ( "/home/users", "/home/sysadm" );
@KeyList = (); 
@TmpList = ();

chop ($Machine = `/bin/hostname`);
# chop ($Machine = `/usr/ucb/hostname`); # for Solaris


#
# Explicit sort subroutine
#
sub by_disk_usage
{
    $Foo{$b} <=> $Foo{$a};  # sort integers in numerically descending order
}


#
# get disk usage for each user and total usage
#
sub get_disk_usage 
{
    foreach $Directory (@AccountDirs)
    {
        chdir $Directory or die "Could not cd to $Directory\n";
        # system "du -k -s * >> $TmpFile"; # for Solaris 
        system "du -s * >> $TmpFile";
    }

    open(FILEIN, "<$TmpFile") or die "Could not open $TmpFile\n";

    while (<FILEIN>)
    {
        chop;
        ($DiskUsage, $Key) = split(' ', $_);

        if (defined($Foo{$Key}))
        {
            $Foo{Key} += $DiskUsage;
        }
        else
        {
            $Foo{$Key} = $DiskUsage;

            @TmpList = (@KeyList, $Key);
            @KeyList = @TmpList;
        };

        $NumUsers ++;
        $Total += $DiskUsage;
    };

    close(FILEIN);
    unlink $TmpFile;
}


#
# for each user with a public_html directory, ensure that it is
# executable (and a directory) and that the index.html file is readable
#
sub user_and_homepage 
{
    $User = $_[0];

    ($Login, $_, $_, $_, $_, $_, $Gcos, $HomeDir, $_) = getpwnam($User)
        or return "$User</td>";

    if ( -r "$HomeDir/$HtmlIndexFile" )
    {
        return "$Gcos <a href=\"/~$Login\">($User)</a>";
    }
    else
    {
        return "$Gcos ($User)</td>";
    };
}

#
# generate HTML code for the disk usage file
#
sub html_preamble
{
    $CurrentDate = localtime;

    open(HTMLOUT, $HtmlFile) or die "Could not open $HtmlFile\n";
    printf HTMLOUT <<"EOF";
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 3.0//EN">

<!--
  -- Automatically generated HTML
  -- from $PROGRAM_NAME script
  --
  -- Last run: $CurrentDate
  -->

<html>
<head>
<title>
Disk Hog Top $NumUsers on $Machine
</title>
</head>

<body bgcolor="#e0e0e0">
<h1 align=center>Disk Hog Top $NumUsers on $Machine</h1>

<div align=center>
<table>
<tr>
    <td valign=middle><img src="images/piggie.gif" alt="[PIGGIE!]"></td>
    <td valign=middle><em>This is a <a href=$PerlWebHome>Perl</a>
        script which runs<br>
        automatically every night</em><br></td>
</tr>
</table>

<p>
<b>Last run started</b>: $StartDate<br>
<b>Last run finished</b>: $CurrentDate
</p>

<p>
<table border=2>
<tr>
<th>Status</th>
<td>
EOF

    if ($Total > $CRITICAL) 
    {
        print HTMLOUT "CRITICAL!!! - Reduce Disk Usage NOW!";
    }
    elsif (($Total <= $CRITICAL) && ($Total > $DANGER))
    {
        print HTMLOUT "Danger - Delete unnecessary Files";
    }
    else
    {
        print HTMLOUT "Safe";
    }


    printf HTMLOUT <<"EOF";
</td>
</tr>
</table>
</P>

<hr size=4>

<table border=2 width=70%%>
    <tr>
        <th colspan=2>Chart Posn.</th>
        <th>Username</th>
        <th>Disk Usage</th>
    </tr>

EOF
}

#
#
#
sub html_note_time
{
    $StartDate = localtime;
}



#
# for each user, categorize and display their usage statistics
#
sub dump_user_stats
{
    foreach $Key (sort by_disk_usage @KeyList)
    {
        $Position ++;

        print HTMLOUT <<"EOF";
    <tr>\n
        <td align=center>
EOF

        #
        # colour code disk usage
        #
        if ($Foo{$Key} > $RED_ZONE) 
        {
            if ($Foo{$Key} > $RED_ZONE3)
            {
                print HTMLOUT "        <img src=images/ball.red.gif>\n";
            }

            if ($Foo{$Key} > $RED_ZONE2)
            {
                print HTMLOUT "        <img src=images/ball.red.gif>\n";
            }

            print HTMLOUT "        <img src=images/ball.red.gif></td>\n";
        }
        elsif (($Foo{$Key} <= $RED_ZONE) && ($Foo{$Key} > $ORANGE_ZONE))
        {
            print HTMLOUT "        <img src=images/ball.orange.gif></td>\n";
        }
        else
        {
            print HTMLOUT "        <img src=images/ball.green.gif></td>\n";
        }

        print HTMLOUT <<"EOF";

        <td align=center>$Position</td>
EOF

        print HTMLOUT "        <td align=center>";
        print HTMLOUT &user_and_homepage($Key);
        print HTMLOUT "</td>\n";

        print HTMLOUT <<"EOF";
        <td align=center>$Foo{$Key} KB</td>
    </tr>

EOF
    };
}

#
# end HTML code
#
sub html_postamble
{
    print HTMLOUT <<"EOF";
    <tr>
        <th></th>
        <th align=left colspan=2>Total:</th>
        <th>$Total</th>
    </tr>
</table>

</div>

<hr size=4>
<a href="/">[$Machine Home Page]</a>

</body>
</html>
EOF


    close HTMLOUT ;

#
# ownership hack
#
    $Uid = getpwnam("ivan");
    $Gid = getgrnam("users");

    chown $Uid, $Gid, $HtmlFile;
}


#
# main()
#

&html_note_time;
&get_disk_usage;
&html_preamble;
&dump_user_stats;
&html_postamble;

# all done!
Listing 1. diskHog.pl script source.

0 0 * * * /home/sysadm/ivan/public_html/diskHog.pl

Listing 2. root's crontab entry.

Figure 1. diskHog output.


Copyright © 1997, Ivan Griffin
Published in Issue 18 of the Linux Gazette, June 1997


[ TABLE OF CONTENTS ] [ FRONT PAGE ]  Back  Next