, 1 min read

Line Length Distribution in Files

Original post is here eklausmeier.goip.de/blog/2022/11-12-line-length-distribution-in-files.


When processing input files I have to check whether those input files have a common record format. For this I therefore have to compute the line length of each record in the input file.

1. Perl solution. The below program reads the input file and shows a histogram of each line length with its according frequency.

#!/bin/perl -W
# Histogram of line length's

use strict;

my %H;

while (<>) {
    $H{length($_)} += 1;
}

for (sort {$a <=> $b} keys %H) {
    printf("%5d\t%d\n",$_,$H{$_});
}

2. Perl one-liner. Many times a simple Perl program can be converted into a Perl one-liner. See for example Introduction to Perl one-liners, written by Peteris Krumnis. Also see Useful One-Line Scripts for Perl.

perl -ne '$H{length($_)} += 1; END { printf("%5d\t%d\n",$_,$H{$_}) for (sort {$a <=> $b} keys %H); }' <yourFile>

Example usage:

printf "\n\na\n\ab\nabc\n" | perl -ne '$H{length($_)} += 1; END { printf("%5d\t%d\n",$_,$H{$_}) for (sort {$a <=> $b} keys %H); }'

gives

    1   2
    2   1
    3   1
    4   1

3. Awk solution. If Perl is not available, then hopefully Awk is installed. Below Awk program accomplishes pretty much the same.

#!/bin/awk -f

function max(a,b) {
    return  a>b ? a : b
}

    { m = max(length($0),m); x[length($0)] += 1 }

END {
    for (i=0; i<=m; ++i)
        print i, x[i]
}