, 7 min read
Rewriting Perl to plain C
Original post is here eklausmeier.goip.de/blog/2018/01-30-rewriting-perl-to-plain-c.
Perl script was running too slow. Rewriting it in C made it 20 times faster.
1. Problem statement. Analyze call-tree dependency in COBOL programs. There are 77 million lines of COBOL code in ca. 30,000 files. These 30,000 COBOL programs could potentially include 74,000 COPY-books comprising 10 million lines of additional code. COBOL COPY-books are analogous to C header-files. So in total there are around 88 million lines of COBOL code. Just for comparison: the Linux kernel has ca. 20 million lines of code.
COBOL program analysis started with a simple Perl script. This Perl script is less than 200 lines, including comments. This script produced the desired dependency information.
Wading through all this COBOL code took up to 25 minutes in serial mode, and 13 minutes using 4 cores on an HP EliteBook notebook using Intel Skylake i7-6600U clocked 2.8 GHz. It took 36 minutes on an AMD FX-8120 clocked with 3.1 GHz. This execution time was deemed too long to see any changes in the output changing something in the Perl script. All runs are on Arch Linux 4.14.11-1 SMP PREEMPT.
2. Result. Rewriting the Perl script in C resulted in a speed improvement of factor 20 when run in serial mode, i.e., run times are now 110s on one core. It runs in 32s when using 8 cores on an AMD FX-8120. C program uses taylormade hashing routines.
3. Runtime. How much time is spent in file processing alone? Word-counting through all COBOL code and COPY-books using wc
took around 40s and 7s for source and COPY-books respectively.
$ /bin/time -p wc * | tail -1 (in directory with COBOL source)
77249523 228229442 3656045826 total
real 40.94
user 39.87
sys 1.02
$ /bin/time -p wc * | tail -1 (in directory with COPY-books)
10403819 39061598 575793581 total
real 7.08
user 6.32
sys 0.54
Above timings on AMD FX-8120 clearly show that all the source code is already completely cached in memory as sys=0.54. Above commands were run multiple times. Timings using zsh-internal time
command always show sys=0, and are probably wrong. These timings give a lower bound on expected runtime of any serial C program. It is interesting to compare wc
performance with grep
performance on the same data:
$ /bin/time -p grep -c DIVISION * | tail -1
ZZZ99:4
real 2.46
user 1.63
sys 0.82
There is a large gap, a factor of 10, when grep
needs to do some heavy duty processing on the input:
$ /bin/time -p grep -P "\s+CALL\s+'?([\-\w]+)" * | tail -1
ZZZ99: CALL abc USING uvw
real 22.83
user 22.08
sys 0.74
Perl script used 36 minutes on AMD FX-8120.
$ time siriusDynUpro * > ../resultPerl
real 2185.76s
user 2178.23s
sys 0
I.e., 2185s=36m. Perl is version v5.26.1 built for x86_64-linux-thread-multi.
C program used less than 2 minutes and is therefore 20-times faster than the Perl script.
$ /bin/time -p siriusDynCall * > ../resultC
real 110.24
user 106.19
sys 4.02
4. Hashing in C. While Perl has hashing already built-in, in C one has to either program this oneself, or find a suitable library, e.g., GLib. I wrote on hashing in Hashing Just Random Numbers and Hash functions: An empirical comparison — article by Peter Kankowski. This time I needed lookup, insert, and iterating over all entries. I wrote on this in Hashing with Iterator in C. Also see Hardening Perl's Hash Function by Yves Orton.
5. Scanning COBOL. There are a number of dialects of COBOL. In our case we had classical COBOL with
- Sequence number area: columns 1 to 6
- Indicator area: column 7
- Area A and B: columns 8 to 72
- Space or numbering: columns above 72
In our case the COBOL source code might be of mixed case, which must be interpreted as meaning the exact same thing. The next C function cobolGets()
, similar to fgets()
, fills two arrays: One array containing the original source code, and another where all characters are upper case. The latter is used for our analysis.
// Read one line of COBOL Code: ignore first 6 chars, ignore everything over column 72
// Do not return '\n', rtrim line
// Return number of 'useful' chars in buf[]
static int cobolGets (char buf[], char bufUp[], int maxbuf, FILE *fp) {
int c, i, spc;
char *p, *pUp, *q, *qUp; // we always have: p <= q, pUp <= qUp
newline:
p = q = buf;
pUp = qUp = bufUp;
buf[0] = '\0';
bufUp[0] = '\0';
spc = 1; // line contains only space so far
for (i=0; i<maxbuf && (c = fgetc(fp)) != EOF; ++i) {
if (c == '\n') {
if (i < 6 || spc) goto newline; // line too short or spaces only, therefore read next line
else {
*p = '\0';
*pUp = '\0';
return p - buf; // dropped first 6 chars
}
} else if (i < 6 || i > 71) continue; // first six chars in COBOL line are usually digits or space
else {
*q++ = c;
*qUp++ = toupper(c);
if (c != ' ') {
spc = 0; // at least one non-space
p = q; // move p to q
pUp = qUp;
}
}
}
return (-1); // no more 'useful' chars
}
The following Perl code is roughly analogous, but not the same:
chomp;
$_ = substr($_,0,72) if (length($_) >= 72); # delete numbering on the right
s/\s+$//g; # rtrim
next if (length($_) <= 6);
my $c6 = substr($_,6,1);
next if ($c6 eq '*' || $c6 eq '/'); # COBOL comment
$_ = substr($_,7); # delete numbering on the left + delete c6
my $line = uc $_;
Getting individual words or tokens is done with strtok()
. Function strtok()
alters the original string. This is another reason to have the input in two arrays, see discussion above. The C code does not use PCRE, while the Perl script uses regular expressions in multiple ways, at least in 8 places. COBOL words are extracted by
// Split line into tokens/words, return number of tokens found.
static int cobolToken (char line[], char *token[], int maxtoken) {
int n = 1;
if ((token[0] = strtok(line," ().*/'\"")) == NULL) return 0;
while ((token[n] = strtok(NULL," ().*/'\"")) != NULL)
if (++n >= maxtoken) break;
return n;
}
This is roughly analogous to the Perl \s+
idiom.
Excerpt from readCobol()
:
static int readCobol (int dirFlag, const char *fname) {
...
enum { MAXBUF = 1048576, MAXTOKEN=MAXBUF/2 };
static char buf[MAXBUF], bufUp[MAXBUF], *token[MAXTOKEN];
char *pBuf=buf, *pBufUp=bufUp;
struct hashElem *e;
FILE *fp;
...
if ((fp = fopen(fname,"rb")) == NULL) ...
while ((n = cobolGets(pBuf,pBufUp,MAXBUF,fp)) >= 0) {
if (pBuf[0] == '*' || pBuf[0] == '/') continue; // ignore COBOL comment line
pBufUp[0] = ' '; // cope for debugging line: 'D' in column 6
if (strcmp(pBufUp+n-6," EJECT") == 0 // ignore compiler-directing statements
|| strcmp(pBufUp+n-6," TITLE") == 0
|| strcmp(pBufUp+n-6," SKIP1") == 0
|| strcmp(pBufUp+n-6," SKIP2") == 0
|| strcmp(pBufUp+n-6," SKIP3") == 0) continue;
if ((n = cobolToken(bufUp,token,MAXTOKEN)) <= 1) continue; // line not relevant for our purpose
...
if (strcmp(token[1],"DIVISION") == 0) {
if (strcmp(token[0],"DATA") == 0) division = 1;
else if (strcmp(token[0],"PROCEDURE") == 0) division = 2;
else division = 3;
} else if (strcmp(token[0],"COPY") == 0) {
readCobol(strlen(token[1]) > 5,token[1]); // recursive call
} else if (division == 1) { // DATA DIVISION
... work von VALUE clause ...
} else if (division == 2) { // PROCEDURE DIVISION
// check for: MOVE abc TO uvw, or MOVE abc(k) TO uvw
// abc might itself result from spreading, so look in dynHash
p = NULL;
if (n >= 4 && strcmp(token[0],"MOVE") == 0
&& ( (product = hashLookup(pgmHash,token[1])) != NULL
|| (p = hashLookup(dynHash,token[1])) != NULL )
) {
...
} else if (n >= 2 && strcmp(token[0],"CALL") == 0) {
hashInsert(callHash,token[1],"1",1); // token[1] has actually been called
...
}
}
}
if (fclose(fp)) return 3;
return 0;
}
It was at first unexpected that there are real COBOL programs where the VALUE clause spans multiple thousand lines! That is the reason why the buffers buf[]
and bufUp[]
(uppercase of buf[]
) for a single line/statement are so exceptionally large, here 1 MB.
6. Profiling. To detect which C function consumes most of the CPU time I used gprof
. This requires to compile with flags -pg -no-pie
. Flag -pg
alone does not work due to gcc Bug 21189 - gprof doesn't work with code built with PIE
cc -Wall -g -pg -no-pie siriusDynCall.c -o siriusDynCall
Result using gprof siriusDynCall gmon.out
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
82.18 54.76 54.76 167506298 0.00 0.00 cobolGets
11.61 62.49 7.74 24809 0.31 2.67 readCobol
2.78 64.35 1.85 50861689 0.00 0.00 hashKR
1.74 65.51 1.16 89742068 0.00 0.00 cobolToken
1.19 66.30 0.79 50480957 0.00 0.00 hashLookup
0.54 66.66 0.36 74427 0.00 0.00 hashClear
0.05 66.69 0.04 2 17.52 21.76 readZZTab
0.00 66.69 0.00 380732 0.00 0.00 hashInsert
0.00 66.69 0.00 380732 0.00 0.00 sHashInsert
0.00 66.69 0.00 7 0.00 0.00 hashInit
0.00 66.69 0.00 7 0.00 0.00 sHashInit
Clearly I/O dominates, as expected from wc
timings above. Functions sHash*()
are just wrappers around hash*()
.
I also checked with Valgrind whether I messed up with storage allocations.
7. Dead end. Using strace
for wc
and grep
I noticed that read()
was reading in larger chunks than above C program, e.g.,
read(3, " * PROGRAMM: xyz "..., 32768) = 32768
Furthermore I noticed some hitherto unknown fadvise64()
statements, like
stat("xyz", {st_mode=S_IFREG|0644, st_size=216375, ...}) = 0
openat(AT_FDCWD, "uvw", O_RDONLY) = 3
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
statements for wc
. I tried
if (posix_fadvise(fileno(fp),0,0,POSIX_FADV_SEQUENTIAL)) ...
setbuffer(fp,fbuf[0],MAXFBUF);
directly after fopen()
. But this did not change runtime in any form.
8. Outlook. There is still some room for performance improvement. One could use gperftools memory allocator, instead of malloc/calloc/free
from libc
. Although, one can see from the profiling results above that memory allocation is clearly not dominating runtime, rather I/O is.