Every once in a while I get to write a neat piece of code that I can share. This is one of those times. I realize it is not large and by PerlMonk standards not very elegant. The problem therein lies with maintainability over the next few years. Regardless I like what I wrote and would like to share.
At the Circus we had a pretty good idea that we had some data leakage. Nothing like people taking off with everything needed to get home loans and rip off customers, just people not thinking about what they send through email. We didn’t know the extent of the problem or even if we had one. We just weren’t sure. Our C-level executives didn’t believe that employees would be so careless with customer data. We decided to find out.
I must say that the results were actually quite positive. We had a couple of people email work related data home so they could work at home over the weekend and a few emails regarding employment, but they were originated by the prospective employee.
Regardless, in order for us to find out I wrote a few scripts that hook into our email system. One that I am particularly proud of recurses through a directory of email messages and attachments scanning each file for relevant data.
Please note that by the time these scripts touch the data it has been scrubbed by the antivirus and other checks we have in place. I am only looking for keywords or regular expressions that would indicate customer related data loss.
Let me explain the directory structure. Under the email system is the directory /var/spool/filter that contains every email message that has been sent in the last 30 minutes. There is a cleanup process that erases all the files in that directory and that is actually where I wrote the hook, in the cleanup process. Here is a sample listing of the directory.
#ls -1 /var/spool/filter/ msg-1299451572-29517-0 msg-1299451626-29523-0 msg-1299451695-29528-0 msg-1299452467-29565-0 msg-1299452491-29570-0 msg-1299453007-29593-0 msg-1299453086-29599-0
As you can see, each email header ends with a .hed extension and the message is in .txt format. The ETP.doc file is an attachment.
#ls -1 /var/spool/filter/msg-1299451626-29523-0/ ETP.doc msg-29523-1.txt msg-29523-2.dat.hed
The subroutine I am most pleased with is the one that recurses through the directory structure. The slurp command returns a hash and if it is a subdirectory then it is a hash as well. I look for it with the following line of code.
if (ref $structure->{$key} eq 'HASH')
That is how I find subdirectories to push onto the stack of recursive calls. As it traverses each directory it just looks at each file extension and makes a determination as to what to do with it.
I realize most system administrators are asking why I didn’t use the file command to make sure the script was acting appropriately for each file type but that does not work with the new Microsoft document types.
# file Test-Excel.xlsx Test-Excel.xlsx: Zip archive data, at least v2.0 to extract
I thought it was a fun project and I enjoyed writing what I felt was an interesting piece of code.
#!/usr/bin/perl # 2011-01-12 Jud Bishop # This script goes looking for customer data being sent out through email and # flags it for further review. use strict; use warnings; use File::Find; use File::Basename; use File::Copy::Recursive qw(fcopy dircopy rcopy); use File::Slurp::Tree;</code> #my $dir = "/home/jud/TestMessages"; #my $log = "/home/jud/TestMessages/violation"; #my $auditdir = "/home/jud/TestMessages/Trash/"; my $dir = "/var/spool/filter"; my $log = "/var/log/hipaa/violation"; my $auditdir = "/opt/smtpaudit/"; my $debug = 0; ################### # MAIN ################### my %tree; my $tree = slurp_tree($dir); open (LOG, '>>', $log) or die $!; traverse_structure($dir, $tree); close LOG or die $!; ########## # This does the heavy lifting of the whole program. It recursively # iterates through the directory structure and works on a file accordingly. # Each directory is a hash key. ########## sub traverse_structure { if($debug){print "##traverse_structure\n";} my ($base, $structure) = @_; my $path; my @violation; my $secure; foreach my $key ( keys %$structure) { $path = $base . "/" . $key; $secure = 0; ## If it's a HASH then it's a directory. if (ref $structure->{$key} eq 'HASH'){ if($debug){print "key: $key\n"}; traverse_structure( $path, $structure->{$key} ); } else { if($debug){print "file : $key\n"}; if($debug){print "base : $base\n"}; if($debug){print "path : $path\n"}; if($debug){print "secure: $secure\n"}; if($debug){print "violation: $#violation\n"}; ## If the file is not being used... if ($path =~ m/doc$/){ parse_doc($path, \@violation); } elsif ($path =~ m/xlsx$|xls$/) { parse_excel($path, \@violation); } elsif ($path =~ m/txt$/) { parse_message($path, \@violation); } elsif ($path =~ m/pdf$/) { parse_pdf($path, \@violation); } elsif ($path =~ m/hed$/) { parse_head($path, \@violation, \$secure); } } } # If it is a secure email than it is encrypted on the fly and not a violation. if ( ($secure == 0) && ($#violation > 3) ){ push (@violation, "EMAIL: " . $base); log_it(@violation); copy_dir($base); } } # For later review. sub copy_dir { my $path = shift; if($debug){print "##copy_dir $path\n";} my $file = fileparse($path); if ($file =~ m/^msg/){ my $basename = basename($path); my $newpath = $auditdir . $basename; if($debug){print "dircopy $path $newpath\n";} dircopy($path,$newpath); } } # Log file that is easy to ready because an employee goes through # this file and decides if it is a REAL violation. sub log_it { my @text = @_; my $line; if($debug){print "##log_it\n";} print LOG "---------------------------------------------\n"; foreach $line (@text) { print LOG "$line\n"; } print LOG "---------------------------------------------\n"; } sub parse_head { my ($file, $violation_ref, $secure_ref) = @_; my @body; my $line; if($debug){print "##parse_head $file\n";} open(FILE,$file) || return 0; @body = ; close(FILE); foreach $line (@body) { if ($line =~ m/^From/){ push (@$violation_ref, $line); } elsif ($line =~ m/^To/) { push (@$violation_ref, $line); } elsif ($line =~ m/^Subject/) { push (@$violation_ref, $line); if ($line =~ m/^secure/i ) { $$secure_ref = 1; } } } } sub parse_pdf { my ($file, $violation_ref) = @_; my @body; my $new_file = $file . ".txt"; my $CMD; if($debug){print "##parse_doc $dir $file\n";} $CMD = "/usr/bin/pdftotext \"" . $file . "\" > \"" . $new_file . "\""; if($debug){print "CMD: $CMD\n";} system($CMD); parse_text ($new_file, $violation_ref); } sub parse_doc { my ($file, $violation_ref) = @_; my @body; my $new_file = $file . ".txt"; my $CMD; if($debug){print "##parse_doc $dir $file\n";} $CMD = "/usr/bin/antiword -st \"" . $file . "\" > \"" . $new_file . "\""; if($debug){print "CMD: $CMD\n";} system($CMD); parse_text ($new_file, $violation_ref); } sub parse_excel { my ($file, $violation_ref) = @_; my @body; my $new_file = $file . ".txt"; my $CMD; if($debug){print "##parse_excel $file\n";} $CMD = "/usr/local/bin/antiexcel \"" . $file . "\" > \"" . $new_file . "\""; if($debug){print "CMD: $CMD\n";} system($CMD); parse_text ($new_file, $violation_ref); } sub parse_text { my ($file, $violation_ref) = @_; my @body; if($debug){print "##parse_text $file\n";} open(FILE,$file) || return 0; @body = ; close(FILE); compare_text(\@body, $violation_ref); } sub parse_message { my ($file, $violation_ref) = @_; my @body; if($debug){print "##parse_text $file\n";} open(FILE,$file) || return 0; @body = ; close(FILE); compare_text(\@body, $violation_ref); } # All of the earlier subroutines call this one. # It takes the text and looks for keywords. sub compare_text { my ($text_ref, $violation_ref) = @_; my @difference; my @text_array; my @elements; my %count; my %rules; my $element; if($debug){print "##compare_text\n";} foreach $element (@$text_ref){ @elements = split(' ', $element); push (@text_array, @elements); } # The parser was already created above. my @rule = ("DOB", "D.O.B.", "d.o.b.", "dob", "death:", "release", "admit", "admission", "Age:", "SSN", "Social", "Security", "Account", "Acct", "claimant", "MRI", "myelogram", "credit", "card"); # Me being lazy. foreach $element (@rule) { $rules{$element} = 1; } foreach $element (@text_array) { if (exists $rules{$element}) { if($debug){print "$element\n";} $element = "VIOLATION: " . $element; push (@$violation_ref, $element); } # Social Security Number elsif($element =~ /\d{3}-?\d{2}-?\d{4}/) { if($debug){print "$element\n";} $element = "VIOLATION: " . $element; push (@$violation_ref, $element); } # Credit Card Number or MRN elsif($element =~ /\d{4}-?\d{4}-?\d{4}-?\d{4}/) { if($debug){print "$element\n";} $element = "VIOLATION: " . $element; push (@$violation_ref, $element); } } }