, 1 min read

Unix Command comm: Compare Two Files

Original post is here eklausmeier.goip.de/blog/2018/02-22-unix-command-comm-compare-two-files.


One lesser known Unix command is comm. This command is far less known than diff. comm needs two already sorted files FILE1 and FILE2. With the options

  1. -1 suppress column 1 (lines unique to FILE1)
  2. -2 suppress column 2 (lines unique to FILE2)
  3. -3 suppress column 3 (lines that appear in both files)

For example, comm -12 F1 F2 prints all common lines in files F1 and F2.

I thought that comm had a bug, so I wrote a short Perl script to simulate the behaviour of comm. Of course, there was no bug, I just missed to notice that the records in the two files did not match due to white space.

#!/bin/perl -W
use strict;

use Getopt::Std;
my %opts = ('d' => 0, 's' => 0);
getopts('ds:',\%opts);
my $debug = ($opts{'d'} != 0);
my $member = defined($opts{'s'}) ? $opts{'s'} : 0;

my ($set,$prev) = (1,"");
my %H;

while (<>) {
        $prev = $ARGV if ($prev eq "");
        if ($ARGV ne $prev) {
                $set *= 2;
                $prev = $ARGV;
        }
        chomp;
        $H{$_} |= $set;
        printf("\t>>\t%s: %s -> %d\n",$ARGV,$_,$H{$_}) if ($debug);
}

$member = 2*$set - 1 if ($member == 0);
printf("\t>>\tmember = %d\n",$member) if ($debug);
for my $i (sort keys %H) {
        printf("%s\n",$i) if ($H{$i} == $member);
}

Above Perl scripts does not need sorted input files, as it stores all records of the files in memory, in a hash. It uses a bitmask as a set. For example, mycomm -s2 F1 F2 prints only those records, which are only in file F2 but not in F1.