Tuesday, April 21, 2009

Parsing HTML in Perl

# Load the file into a tree
$html_tree = HTML::TreeBuilder->new;
$html_tree->;parse_file($file_name);

# Get all of the meta tags
@meta_tags = $html_tree->find('meta');

The code takes advantage of the HTML::Tree Perl module from CPAN to take the HTML file, referenced in the $file_name variable, and build a tree of the tags in memory. Once the tree is built I can use the find method to find all of the meta tags and put them into the array @meta_tags. Once I have the array I can step through them one at a time and process them as required.

It is worth noting the HTML::Tree module is dependent on the HTML::Parser module which is dependent on the HTML::Tagset module.

Following link contains
http://www.tc.umn.edu/~hause011/code/extract_from_many_excel.html
code to extract escel data.

EXTRACTING DATA FROM TABLE using Perl

erl has a module that does this: HTML::TableExtract (http://kobesearch.cpan.org:/htdocs/HTML-TableExtract/HTML/TableExtract.html), the examples are a good start, don't forget to add "my $te" to each variable declaration when using "use strict".

Also, to download the HTML data directly you could use:

#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;
use LWP::Simple;
my $html = get("http://ubuntuforums.org");
my $table = HTML::TableExtract->new;
$table->parse($html);
# Table parsed, extract the data.
open OUTPUT, "| grep 'foo' > result.txt" or die "Failure: $!";

We can then write whatever we want to the "OUTPUT" filehandle. The Unix "grep"
command will filter out any text which doesn't contain the text "foo"; any text
which DOES contain "foo" will be written to "result.txt".


Don't Use OPEN
And for programs that only return a few lines of output, use back-ticks:

my $username = `whoami` or die "Couldn't execute command: $!";
chomp($username);
print "My name is $username\n";

You can also use the "qx//" operator, which is exactly equivalent to back-ticks
except that it allows you to choose your delimiter:

# All of the following are exactly equivalent
# to the above command that uses back-ticks.
my $username = qx/whoami/ or die "Couldn't execute command: $!";
my $username = qx(whoami) or die "Couldn't execute command: $!";
my $username = qx#whoami# or die "Couldn't execute command: $!";

COMMAND LINE ARGUMENTS

Command-Line Arguments :: @ARGV

* So how do we get command line arguments? Perl automatically stores any command line arguments into a global list called @ARGV.
* Each element in @ARGV sequentially contains the command-line arguments called with the program. You can parse these manually by hand, or use the more "sophisticated" getopt method below.

Command-Line Arguments :: Getopt::Std and Getopt::Long

* Perl provides getopt-like functionality using the packages Getopt::Std and Getopt::Long. For example:

use Getopt::Std;

getopt('oDI'); # -o, -D & -I take arg. Sets opt_* as a side effect.
getopt('oDI', \%opts); # -o, -D & -I take arg. Values in %opts
getopts('oif:'); # -o & -i are boolean flags, -f takes an argument
# Sets opt_* as a side effect.
getopts('oif:', \%opts); # options as above. Values in %opts


This example is taken from the man pages. getopt works much in the same way as getopt in C, but simpler. In the last example, getopts will take in -o, -i, -f with -f having arguments. The hash %opts contains the arguments, with the switches as the keys for the hash.
* Getopt::Long can take in switches longer than one character.

Tangent :: Using Packages/Modules

* You've probably noticed by now that we can call libraries, packages, or modules (whatever you want to call them), using the use keyword followed by the module.
* In the above example, we used the Getopt::Std package. There are tons of standard modules and even more non-standard modules. Think of these as ANSI C Standard libraries and non-standard libraries for Perl. Other packages include File::Copy, Text::ParseWords, Math::Complex, and CGI. There are many, many more. See Schwartz Pgs. 238-242, and Wall Ch. 30-32 (pgs. 831-915).

File Type Test

File Input/Output :: File Tests

* For further fool-proof file Error Handling and the Misc. you can pass file tests to filehandles. To test the existence of a file you can do this for example:

$filename = "hooray.index";
if (-e $filename) {
open(outf, ">>hooray.index");
} else {
open(outf, ">>hooray-hooray.index");
}

* There are many other file tests: -T tests for ASCII file, -B tests for a Binary file, -d tests for a directory, -l tests for a symlink.

Perl provides a way to query information about a particular file also:

($dev, $ino, $mode, $nlink, $uid, $gid, $rdev, $size, $atime, $mtime,
$ctime, $blksize, $blocks) = stat ($filename);


The stat function provides all these fields about the particular file $filename. Refer to Wall Pgs. 800-801 for more information about stat. You can also get information about a file using the File::stat package:

use File::stat;
$sb = stat($filename);
print "$sb->size\n";


This would print the size of the file. More on File::stat in the Larry Wall book.

File Error Handling in Perl

File Input/Output :: Error Handling
What if we can get a filehandle? We should handle this case graciously by using die. For example:

open (inf, "/some/directory/file.in") || die "cannot open: $!";

So what does this do? Perl tries to open file.in OR it calls die with the string. The $! contains the most recent system error, so it will append a useful tag to the output of die. You could even make a dienice subroutine that could be more helpful. You can exit a Perl program immediately by calling exit;.

File Handling In perl Script

Handling input and output from/to files and the Command Prompt

As previously noted, programs consist primarily of data and instructions for manipulating the data. Earlier we looked at programs where the data was included in the program code. However, programs can also manipulate data from external sources, such as text files and the terminal.
Input/Output from/to the Terminal

There are two principle ways of getting data into and out of Perl scripts: by using the terminal and by using files. The "terminal" is basically the command prompt supplied by the operating system of your computer. Any output to the terminal is known to Perl (and to most Unix-based programming languages) as STDOUT, and any input from the terminal is known as STDIN. The most common use of STDOUT is with the print function; in fact, if you do not tell "print" where to print to, it outputs to STDOUT by default. Therefore, the two statements are equivalent:

print "Hello";
print SDTOUT "Hello";

In the scripts that we will be writing in this course, we won't be using STDIN, which is mainly useful for short amounts of unstructured data (our data tends to be highly structured, using tabs, fields, etc.). However, since it is so easy to print to STDOUT, when writing short scripts many people often print to SDTOUT and redirect the output to a file. This allows them to forgo opening and writing files from within Perl (which we will look at in a moment). To redirect STDOUT to a file, use the following syntax:

perl script.pl > captured_output.txt

The > operator is not part of Perl; rather, it is part of most operating systems including Unix, Linux, Mac OS X and Windows.

Redirecting STDOUT to a file is not always desirable. For example, if you want your script to output more than one file, redirecting is not straight forward. Also, redirecting only works well when the output is plain text. MARC communications files are binary, so we should not use redirection to create them.
Input/Output from/to Files

Perl uses the following functions to open and close files (appropriately called
"open" and "close"):

open (INPUTFILE, "$input");
close (INPUTFILE);

"INPUTFILE" is called a filehandle, and is the name that Perl uses to refer to the open file. The actual name has no significance (we could have called this one CGGG6HHH), but by convention it is in upper-case characters (but doesn't have to be). The second parameter, "$input" in this case, is a variable that contains the location of the file that is to be associated with the filehandle. In you script

It is common to include error-checking code in file open and close statements, since files can have any number of problems, such as not being there (file not found) or permissions problems. The open statement above with this type of error checking is:

open (INPUTFILE, "$input") or die ("Problem with opening $input: $!");

$! is a special Perl variable that contains the last error message, which in this case will be the fatal error produced by Perl if it can't open the identified file.

You define how you want an open file to interact with your script by assigning a "mode". The three most common modes are read, overwrite, and append, signified by <, >, and >>, respectively. The mode indicators are prepended to the location of the file. For example,

open (FILE, "<$file"); # Means open $file in read mode
open (FILE, ">$file"); # Means open $file in overwrite mode
open (FILE, ">>$file"); # Means open $file in append mode


Now, to put the pieces together. If you want to read data from a file, simply open it in read mode. The contents of the file will be added to the filehandle, which you can then manipulate in various ways. We'll describe how to read a file line at a time later.

If you want to add data to a file (called "print" the data to the file), you need to open the file in either overwrite or append mode (depending on what you want to do) and then use the print function along with the filehandle you want to print to, like this:

open (OUTPUTFILE, ">$output") or die ("Problem with opening $output: $!");
print OUTPUT "Whatever you want to add to the file"; close OUTPUT;


Notice that in this example we printed to OUTPUT just like we printed to STDOUT above. STDOUT is actually a special filehandle reserved for the terminal.

It is not usually necessary to explicitly close open files, but it is a good habit to get into.
Exercise (Spreadsheet)

To illustrate how the above works in practice, we will write a spreadsheet program.

First create the spreadsheet in Jedit:

12 4 167 17 8
9 34 4 12 1
62 14 67 0 88
78 9 34 67 5


Thats five numbers on each line, separated by tabs. They don't have to be these numbers exactly.

Now let's write a new program, spreadsheet.pl, that will add up all the numbers on each line and give us separate totals.

#this is the world's simplest spreadsheet program

open (SPREADSHEET, "spreadsheet.txt")
or die ("Problem opening file: $!");

while () {

$numbers = $_;
chomp($numbers);

@numbers = split(/\t/, $numbers);

$total = 0;

foreach $number (@numbers){
$total = $total + $number;
}

print $total . "\n";

}

close (SPREADSHEET);


How it Works

As you can see, this program re-uses a lot of the same logic that we covered with hello.pl, and yet what it does is very different.

while means 'while whatever is between the brackets returns a value, do whatever is between the curly braces'. The standard way that perl reads files is one line at a time. So the instructions between the braces gets repeated for every line in the file.

As lines from the file are read into the program, perl doesn't really know what to call them. So it assigns them the variable name $_, for lack of anything more helpful. You could continue to refer to them that way, but it makes the rest of the program more readable to give the incoming lines a more descriptive variable name. So they are assigned here to the $numbers variable, using the assignment operator, =.

Other things we have not yet seen are the addition operator, +, and the split instruction. The split instruction splits data into chunks on whatever you specify as the separator, and stores the resulting values in an array. In this case, we've told split to split the line using the tab character as the separator, represented by \t.

After we've used the split instruction to populate the @numbers array, we then use foreach to run through each number in the array, adding it to the $total variable. This is an example of a nested loop (the foreach loop is nested within the while loop, which means the entire foreach loop is run for every iteration of the while loop. There is no limit as to how deeply you can nest loops - theoretically you can have loops within loops within loops ad infinitum. However, too much nesting will really slow down your script.