tv_grab_be.in - OpenGrok cross reference for /dports/textproc/p5-xmltv/xmltv-0.5.70/grab/be/tv_grab_be.in

#!/usr/local/bin/perl -w
=pod

=head1 NAME

tv_grab_be - Grab TV listings for Belgium

=head1 SYNOPSIS

tv_grab_be --help

tv_grab_be [--config-file FILE] --configure [--slow] [--gui OPTION]

tv_grab_be [--config-file FILE] [--output FILE] [--days N]
           [--offset N] [--quiet] [--slow] [--gui OPTION]

tv_grab_be [--output FILE] [--quiet] [--config-file FILE] --list-channels

tv_grab_be --capabilities

tv_grab_be --version

=head1 DESCRIPTION

Output TV and radio listings in XMLTV format for many stations
available in Belgium.  The data comes from the Sanoma magazines'
websites: Tele Moustique and Teve Blad.

=head1 USAGE

First you must run B<tv_grab_be --configure> to choose the language,
grab mode and which stations you want to receive.

Then running B<tv_grab_be> with no arguments will get about 6
dayE<39>s of summary only listings for the channels you chose.

If you want to grab detailed information (such as episode name,
detailed descriptions, actors) then use the B<--slow> flag when both
onfiguring and running the grabber. The configure mode will prompt
you for selection criteria for when the grabber should retrieve
detailed information for programmes (selected by start time, category,
and channel). This makes grabbing slow (hence the option name!)

Note that different stations ar available in French and Dutch modes
due to the listings differences from the two sites. The data is also
different -- the French site has more detailed info for the french
language channels, and the Dutch site has more info for the Dutch
language channels.

Some channels (BBC World, Euronews) although listed, have such bad
listings data that you would be better off using the tv_grab_uk_rt
grabber, and then merging the resulting files with tv_cat!

It is perhaps worthwhile to use the --config-file option to maintain
one config file with a selection of channels for each language, then
using tv_cat to merge the resulting XML files.

B<--configure> Prompt for language, grab mode and which
stations to download and write the configuration file (see also --slow)

B<--config-file FILE> Set the name of the configuration file, the
default is B<~/.xmltv/tv_grab_be.conf>.  This is the file written by
B<--configure> and read when grabbing.

B<--output FILE> When grabbing, write output to FILE rather than
standard output.

B<--days N> When grabbing, grab N days rather than as many as
possible.

B<--offset N> Start grabbing at today + N.  N may be negative.

B<--quiet> Suppress the progress messages normally written to standard
error.

B<--slow> Slow mode: get detailed information for specified
programmes. With B<--configure>, this enables the configuration
routine to prompt for the criteria which programs have to match for
detailed information downloading. Otherwise, this enables the grabbing
of detailed for programmes matching the defined criteria.

B<--trace> Show debug information (if L<Log::TraceMessages> is installed)

B<--gui OPTION> Use this option to enable a graphical interface to be used.
OPTION may be 'Tk', or left blank for the best available choice.
Additional allowed values of OPTION are 'Term' for normal terminal output
(default) and 'TermNoProgressBar' to disable the use of Term::ProgressBar.

B<--list-channels> Dump channel information for all channels but no
programmes.  This grabber needs a config file first before the
channels can be dumped.

B<--capabilities> Show which capabilities the grabber supports. For more
information, see L<http://wiki.xmltv.org/index.php/XmltvCapabilities>

B<--version> Show the version of the grabber.

B<--help> Print a help message and exit.

=head1 WARNING

In B<--slow> mode, unning this grabber requires very many web page
fetches (one per channel per day, and then one per programme selected
for detailed information) from a very slow web site.

The number of web page fetches can be limited by limiting the number
of programs to get detailed information for (by start time range,
category or channel). This is defined when run with B<--configure
--slow> or in the config file.

=head1 SEE ALSO

L<xmltv(5)>, L<http://www.telepocket.be>, L<http://www.teveblad.be>

=head1 AUTHOR

Niel Markwick, nielm@bigfoot.com
Based on B<tv_grab_uk_rt>

=head1 BUGS

The website parsing isnE<39>t perfect and there may be warning
messages about bits of HTML that arenE<39>t understood.  Some of the
details provided by the site have to be thrown away because they
cannot be accommodated in the XMLTV format; again, warning messages
are printed.

Programmes containing defined sections are not handled very well (such
as Sportpaleis on Canvas) because the data source lists the sections
separately with overlapping timeslots. eg:

=over

=item "13.30-14.00 Hands Up!"

=item "13.30-17.30 Sportpaleis"

=item "14.00 14.30 Champions League Magazine"

=item "14.30 17.00 Wielrennen: Kuurne - Brussel - Kuurne"

=item "17.00 17.15 Autorennen: F1"

=item "17.15 17.30 Daar is 'm!"

=back


The data on the website can also be poor. Program names gain and lose
random punctuation from week to week. eg:

=over

=item "Buffy, the Vampire Slayer.",

=item "Buffy the Vampire Slayer",

=item "Buffy, the Vampire Slayer",

=item "Buffy the Vampire Slayer."

=back

The grabber strips trailing punctuation to help avoid this.

Sometimes the stop time is not put on last programme of the day. This
can be worked around by piping the output through tv_sort, and the start
time of the first program of the next day will be used.
Alternatively, the following complex tv_grep command can add an
implicit stop time of 06:00:

tv_grep -e 'if (not ${$_}{stop}) { (${$_}{stop} = ${$_}{start}) =~ s/\d{6}\b/060000/ }; 1'

Finally there are several things still to do (see TODO list in source
code for full description).

=head1 HISTORY

B<2008-10-14 nielm> - remove teveblad due to user-agent blocking

B<2007-01-09 nielm> - switch back to telemoustique for fr

B<2004-01-08 nielm> - first version with selective detail grabbing,
based on tv_grab_uk_rt 0.5.27

B<2004-01-09 nielm> - Disable detail grabbing by default; implemented
--slow option to enable detail grabbing and to complicate
configuration procedure; . Removed implicit generation of stop
time. Correct windows special characters (128-159): oe ligatures ->
oe; fancy quotes -> normal quotes, others -> ? (with warning); Fixed
warning about Log::Tracemessages::On.

B<2004-01-13 nielm> - Replace '...' Windows character; Added parsing
of star ratings; warn about unrecognised images in description text;
Added parsing of movie ratings (classifications); Future-proof config
file to cope with grabbing multiple languages simultaneously; Added
--output option

B<2004-01-15 nielm> - Put year into date tag; Put director of films
(if found in descr) into director tag; Get year from descr if not
found, add channel logos (from satlogo.com).

B<2004-01-26 nielm> - handle VO/OV image without warning; make
multi-line descriptions; correct date parsing; correct episode num
in FR listings; correct stop time bug when no stop time defined;
removed lang=xx from title and sub-title;

B<2004-01-29 nielm> - add icons in rating and star-rating; handle
episode numbers in titles better; handle extracting of director
better.

B<2004-02-23 nielm/epaepa> - tidy up of help text, remove newlines
from desc, improve start/stop time details matching, add detaul URL to
fast mode programme info

B<2004-03-04 nielm> - Correct usage, handle Duree (length), handle
repeats (previously-shown), handle episode numbers in description.
More things added to TODO list (see source code)

B<2004-03-09 nielm> - Remove Duree and (R.) from description
text. Remove categories in description. Do not put episode numbers in
sub-title

B<2004-04-01 nielm> - Fixed bug with no stop time for programs
starting at midnight, handle 'New' icon, Fixed JIMTV channel ID,
handle 'Divers' tags (which may contain info on previously-shown),
remove 'gastacteurs:' from actor names.

B<2004-04-05 epaepa/nielm> - Make time zones consistantly +0100 or
+0200, but never mixed. Clean up punctuation around actor
names. Remove duplicate ChannelID from file (only put alternative
ID's). better actor parsing

B<2004-04-05 epaepa/nielm> - Handle user input better during config
(CTRL-D, CTRL-H), use substrings not regexps for category/channel
matching to avoid nastyness when a user enters a bad regexp. Add magic
category *NONE* and *ALL* for category matching.

B<2004-04-15 nielm> - yet more cleanup in parsing actor names, ignore
programmes with no titles.

B<2005-03-12 nielm> - no function changes: just changes to the
configuration section to use XMLTV::ask, and updates to the
channel_ids files

B<2005-09-22 nielm> - www.telemoustique.be no longer works: quick
and dirty fix: use www.telepocket.be instead.

=cut

# TODO
#
# Merge fr and nl configs to allow single config file.
#
#
# Handle listings where a program is shown again later in the day
# without a separate lsting for it: eg:
#    <desc lang="fr">les moments les plus delirants de
#    l'emission. (13.30, 16.15, 20.45, 0.30)</desc>
#    <desc lang="fr">S�rie anim�e. La d�cision de Petit-Coeur
#    (R. � 17.30 et 24.00)</desc>
#    <desc lang="fr">S�rie australienne (R. � 18.00)</desc>
# implies this program will be shown at these later times...
#
#
# Handle Followed by 'Suivi' in Fremch descriptions
#   <desc lang="fr">(R.) Suivi de Le Shopping.</desc>
#   <desc lang="fr">Suivi, � 14.40, de Tranche de rire.</desc>
#   <desc lang="fr">suivi � 14.10 de La Boutique - 14.40 Tranche de rire.</desc>
#   <desc lang="fr">S�rie quotidienne fran�aise suivie, � 19.55, de la M�t�o.</desc>
#   <desc lang="fr">suivi de L'invit� - La m�t�o</desc>
#   <desc lang="fr">suivi de L'invit� - La m�t�o. Pascal Vrebos re�oit Laurette Onkelinx, Ministre de la Justice (PS).</desc>
#
# Handle grouped programmes
#     <desc lang="fr">Le lutin Plop
#      - 7.00 La cour de r�cr� (R.)
#      - 7.25 Pepper Ann (R.)
#      - 7.50 Jim Bouton.</desc>
#


use strict;
use XMLTV::Version '$Id: tv_grab_be.in,v 1.17 2010/09/02 05:07:40 rmeden Exp $ ';
use XMLTV::Capabilities qw/baseline manualconfig cache/;
use XMLTV::Description 'Belgium';
use XMLTV::Supplement qw/GetSupplement/;

use IO::Socket;
use LWP::Simple;
use Date::Manip;
use Getopt::Long;
use HTML::Entities;
use XMLTV;
use XMLTV::Memoize;
use XMLTV::Ask;
use XMLTV::ProgressBar;
use XMLTV::DST;
use XMLTV::Config_file;
use XMLTV::Get_nice;
use XMLTV::Date qw(parse_date);
use XMLTV::Usage <<END
To configure:     $0 --configure [--config-file FILE] [--slow] [--gui OPTION]
To grab listings: $0 [--config-file FILE] [--output FILE]
                     [--days N] [--offset N] [--quiet] [--slow] [--gui OPTION]
To list channels: $0 [--output FILE] [--quiet] [--config-file FILE] --list-channels
To show capabilities: $0 --capabilities
To show version: $0 --version
END
  ;

# Use Log::TraceMessages if installed.
BEGIN {
    eval { require Log::TraceMessages };
    if ($@) {
        *t = sub {};
        *d = sub { '' };
    }    else {
        *t = \&Log::TraceMessages::t;
        *d = \&Log::TraceMessages::d;
        Log::TraceMessages::check_argv();
    }
}

sub get_url( $ );
sub get_programmes( $$$$$$ );
sub get_programme_summary( $$$$ );
sub get_programme_detailed_info($$);
sub parse_programme_details($$); # ref of prog hash, array of descr strings
sub get_channels();
sub get_categories();
sub get_available_dates();
sub be_to_xmltv( $ );
sub xmltv_to_be( $ );
sub grab( $$ );
sub configure();

# GLOBAL CONSTANTS
my $LANG_FR = 'fr';
my $LANG_NL = 'nl';

# language-dependant constants
my %DOMAIN = ( $LANG_FR => 'telemoustique.be',
		 $LANG_NL => 'teveblad.be' );
my %BASE_URL = ( $LANG_FR => "http://www.$DOMAIN{$LANG_FR}/tm/",
		 $LANG_NL => "http://www.$DOMAIN{$LANG_NL}/ndl/");

# channel to use for getting dates
my %DATE_CH   = ( $LANG_FR => 'LA%20UNE',
		  $LANG_NL => 'TV1' );

my %SUMMARY_PATH = ( $LANG_FR => "programme_tele_chaine.html",
                    $LANG_NL => "zender.asp" );


my %DETAIL_PATH = (  $LANG_FR => "programme_tele_detail.html?progid=" ,
		     $LANG_NL => "detail.asp?progid=" );

# populated from config file
my @detailgenre;
my $detailstarttime;
my $detailstoptime;
my %get_channel_detail;
my $LANG;

#stats
my $numwebgets=0;
my $kbwebgets=0;
my $statstarttime=time();

# Check options.  First do the undocumented --cache option (to cache
# get(), which retrieves web pages), then the normal ones.
#
my $using_cache
     = XMLTV::Memoize::check_argv('XMLTV::Get_nice::get_nice_aux');
my ($opt_days,
    $opt_help,
    $opt_output,
    $opt_input,
    $opt_slow,
    $opt_gui,
    $opt_configure,
    $opt_config_file,
    $opt_offset,
    $opt_quiet,
    $opt_list_channels,
   );
# No default for $opt_days, we determine it from the site.
$opt_offset = 0; # default today
$opt_quiet  = 0; # default
GetOptions('days=i'        => \$opt_days,
           'help'          => \$opt_help,
           'configure'     => \$opt_configure,
           'slow'          => \$opt_slow,
           'gui:s'         => \$opt_gui,
           'config-file=s' => \$opt_config_file,
           'output=s'      => \$opt_output,
           'offset=i'      => \$opt_offset,
	   'list-channels' => \$opt_list_channels,
           'quiet'         => \$opt_quiet,
           'input=s'       => \$opt_input, # undocumented -- debug mode:
	                                   # read data from html file,
	                                   # no web page gets apart
	                                   # from program details
          )
  or usage(0);
die 'number of days must not be negative'
  if (defined $opt_days && $opt_days < 0);
if ($opt_help) {
    usage(1);
}

# Date::Manip has a bug where 'now' will be wrong if you change the
# timezone.  It won't be correctly converted from the system timezone
# to the new one.  So we call parse_date('today midnight') _before_
# Date_Init().
#
my $today = DateCalc(parse_date('today midnight'), "$opt_offset days");
Date_Init('TZ=UTC');

XMLTV::Ask::init($opt_gui);

# Tables to convert between telemoustique / teveblad and XMLTV ids of channels.
# The way to access these is through the routines be_to_xmltv() and
# xmltv_to_be(), not directly.  Those will deal sensibly with a new
# channel that isn't mentioned in the file.
#
my (%be_to_xmltv, %xmltv_to_be, %extra_dn, %ch_warn, %logourl);
my $line_num = 0;

foreach my $CURLANG ( $LANG_FR, $LANG_NL ) {
    my $CHANNEL_NAMES_FILE = "channel_ids_$CURLANG";
    my $str = GetSupplement( 'tv_grab_be', $CHANNEL_NAMES_FILE );

    $line_num=0;
    foreach (split( /\n/, $str)) {
        ++ $line_num;
        tr/\r//d;
        s/#.*//;
        next if m/^\s*$/;
        my $where = "$CHANNEL_NAMES_FILE:$line_num";
        my @fields = split(/:/,$_,5);
        die "$where: wrong number of fields: " . (scalar @fields)
            if @fields < 4 or @fields > 5;
        my ($xmltv_id, $be_id, $extra_dn, $logourl, $ch_warn) = @fields;
	die "$where Sanonma id not specified"
	    if ( not defined $be_id || $be_id eq '' );
        warn "$where: $CURLANG Sanoma id $be_id seen already\n"
            if defined $be_to_xmltv{$CURLANG}{$be_id};
        $be_to_xmltv{$CURLANG}{$be_id} = $xmltv_id;
        warn "$where: $CURLANG XMLTV id $xmltv_id seen already\n"
            if defined $xmltv_to_be{$CURLANG}{$xmltv_id};
        $xmltv_to_be{$CURLANG}{$xmltv_id} = $be_id;
        $extra_dn{$CURLANG}{$xmltv_id} = $extra_dn
	    if ( defined $extra_dn && $extra_dn ne '' );
        $logourl{$CURLANG}{$xmltv_id} = $logourl
	    if ( defined $logourl && $logourl ne '' );
        $ch_warn{$CURLANG}{$xmltv_id} = $ch_warn
	    if ( defined $ch_warn && $ch_warn ne '' );
    }
}
t 'xmltv_to_be: ' . d \%xmltv_to_be;
t 'be_to_xmltv: ' . d \%be_to_xmltv;
t 'extra_dn: ' . d \%extra_dn;
t 'ch_warn: ' . d \%ch_warn;

# Arguments for XMLTV::Writer.
my %g_args = ();
if (defined $opt_output) {
    die "cannot have both --output and --configure\n" if $opt_configure;
    my $fh = new IO::File ">$opt_output";
    die "cannot write to $opt_output\n" if not $fh;
    %g_args = (OUTPUT => $fh);
}

# Find the configuration file.  This grabber needs it even for listing
# channels since the channels available depend on the language.
#
my $config_file
  = XMLTV::Config_file::filename($opt_config_file, 'tv_grab_be', $opt_quiet);

if ($opt_configure) {
    configure();
    exit;
}

# Not configuring - need to read an existing config file.
my @config_lines = XMLTV::Config_file::read_lines($config_file);

# Read the configuration file for language option
# language <F|D>
foreach (@config_lines) {
    ++ $line_num;
    next if not defined;
    my $where = "$config_file:$line_num";
    if (/^language\s+(.+)/) {
	if ( $1 eq $LANG_FR || $1 eq $LANG_NL) {
	    $LANG=$1;
	}
	else {
	    die "$where: invalid language defined in conf file\n";
	}
    }
}
die "language not defined in $config_file" if (not defined $LANG );

if ( $LANG eq $LANG_NL ) {
    die "Dutch language grabbing is no longer supported because Teveblad blocks XMLTV\n";
}

# Stuff at the top of any output XML.
my $metadata = { 'source-info-url'     => "$BASE_URL{$LANG}",
		 'source-info-name'    => "$DOMAIN{$LANG}",
		 'generator-info-name' => 'XMLTV',
		 'generator-info-url'  =>
		 'http://xmltv.org/',
	       };

if ($opt_list_channels) {
    # Could check usage here to see --days etc. were not specified but
    # I can't be bothered.
    #
    my %channels = get_channels; # uses $LANG
    my $writer = new XMLTV::Writer(%g_args, encoding => 'ISO-8859-1');
    $writer->start($metadata);
    $writer->write_channels(\%channels);
    $writer->end;
}
else {
    if ($opt_input) {
	warn "grabbing from source HTML file: $opt_input for dummy channel ID la1.rtbf.be\n";
    }
    grab(\%g_args, \@config_lines);
}

printf (STDERR "Accessed %d web pages, downloaded %d Kb, duration %d secs\n",$numwebgets,$kbwebgets,time()-$statstarttime) unless $opt_quiet;
exit();

# Grab listings and write them in XML.  Parameters:
#
# ref to hash of arguments to be passed to XMLTV::Writer (but encoding
#   is always ISO-8859-1),
# ref to list of lines from config file.
#
sub grab( $$ ) {
    my ($w_args, $config_lines) = @_;
    my $writer = new XMLTV::Writer(%$w_args, encoding => 'ISO-8859-1');
    my %write_channels; # to be written as <channel> elements

    # FIXME turn into progress bar.
    print STDERR "finding channels:\t" unless $opt_quiet;

    my %channels;
    if ($opt_input) {
	# skip getting channels
        my @dns = ([ "La Une", $LANG ], [ "La Une"] );
        my $ch = { 'display-name' => \@dns,
                   'id' => "la1.rtbf.be"};
        t 'channel object: ' . d $ch;
        $channels{"la1.rtbf.be"} = $ch;
    }
    else {
	%channels = get_channels();
    }
    print STDERR "got " . (scalar keys %channels) . ", done.\n" unless $opt_quiet;

    # Read the configuration file.  At present the lines must be one
    # of the forms:
    #
    # channel <xmltv id> <fr:nl> [dodetail]
    # language <fr|nl>
    # detailgenere <regex>
    # detailstartime <hh:mm>
    # detailstoptime  <hh:mm>
    # ALL
    #
    my $line_num = 1;
    foreach (@$config_lines) {
        ++ $line_num;
        next if not defined;
        my $where = "$config_file:$line_num";
        if (/^channel\s+([^\s]+)\s+($LANG_FR|$LANG_NL)\s*([^\s]*)/) {
            my $xmltv_id = $1;
	    # $2 is grab language -- for future use when grabber
	    # can simultaneously grab both languages
	    die "$where: Specification of different Grab language currently not implemented"
		unless ( $2 eq $LANG);

            if (not defined $channels{$xmltv_id}) {
                warn "$where: no channel with XMLTV id $xmltv_id, skipping\n";
                next;
            }
            $write_channels{$xmltv_id} = $channels{$xmltv_id};
	    if ( $3 eq "dodetail" ) {
		$get_channel_detail{$xmltv_id} = 1;
	    }
        }
        elsif (/^language\s+(.+)/) {
            # already read -- ignore
        }
        elsif (/^detailstarttime\s+([0-9]{2}:[0-9]{2})$/) {
            $detailstarttime=$1;
        }
        elsif (/^detailstoptime\s+([0-9]{2}:[0-9]{2})$/) {
	    $detailstoptime=$1;
	}
        elsif (/^detailgenre\s+(.+)/) {
	    # allow obsolete detailgeneres for ^$ and .*
	    if ( $1 eq ".*" ) {
		warn "obsolete detailgenre \"$1\"in config file -- replace with *ALL*";
		push @detailgenre, "*ALL*";
	    } elsif ( $1 eq "^\$" ) {
		warn "obsolete detailgenre \"$1\" in config file -- replace with *NONE*";
		push @detailgenre, "*NONE*";
	    } else {
		push @detailgenre, $1;
	    }
        }
        else { die "$where: bad line: \"$_\"\n" }
    }

    t "channels to get detail for: " . d \%get_channel_detail;
    t "detailstarttime=$detailstarttime" if ( $detailstarttime );
    t "detailstoptime=$detailstoptime" if ( $detailstoptime );
    t "genre(s) to get detail for: " . d \@detailgenre;

    if ( $opt_slow )
    {
 die <<END
You must reconfigure with --configure --slow to choose the programmes
to get details for.
END
   if not defined $detailstarttime or not defined $detailstoptime;

	# sanity check detail start time
	if ( ( $detailstoptime ge "24:00" )
	     || ( $detailstarttime ge "24:00" ) ) {
	    die "Invalid detail start/stop time range in $config_file: $detailstarttime - $detailstoptime";
	}
    }
    else
    {
	if ( $detailstoptime
	     || $detailstoptime
	     || @detailgenre
	     || %get_channel_detail ) {
	    say ( <<END
WARNING: Config file contains settings for downloading detailed
programme information, but --slow has not been specified on command
line

No detailed programme information will be downloaded
END
		  ) ;
	}
    }

    # FIXME turn this into progress bar.
    print STDERR "getting dates for which listings available:\t"
      unless $opt_quiet;
    my @available_dates;
    if ($opt_input) {
	# skip getting dates
	@available_dates = ( $today );
    }
    else {
	@available_dates = get_available_dates();
    }
    t 'available dates: ' . d \@available_dates;
    die 'apparently, there are no days of listings on the site'
      if not @available_dates;
    print STDERR "got " . @available_dates .  ", done.\n" unless $opt_quiet;

    my $is_available = sub( $ ) {
        my $d = shift;
        foreach (@available_dates) {
            return 1 if not Date_Cmp($d, $_);
        }
        return 0;
    };

    my @dates_to_get;
    for (my $d = $today; $is_available->($d); $d = DateCalc($d, '+ 1 day')) {
        push @dates_to_get, $d;
    }
    die "listings for today ($today) not available" if not @dates_to_get;
    my $last_day = $dates_to_get[-1];
    foreach (@available_dates) {
        if (Date_Cmp($last_day, $_) < 0) {
            warn "strangely, day $_ is available but there are gaps before it";
        }
    }

    if (defined $opt_days) {
        if ($opt_days > @dates_to_get) {
            warn 'only ' . (scalar @dates_to_get)
              . ' days of consecutive listings available';
        }
        else {
            @dates_to_get = @dates_to_get[0 .. $opt_days - 1];
        }
    }
    my $days = @dates_to_get > 1 ? 'days' : 'day';
    say('getting ' . (scalar @dates_to_get) . " $days of listings\n")
	unless $opt_quiet;
    say("(\"#\" indicates a program with summary info,\n" .
        " \"@\" indicates a program with detailed info)\n")
	unless ($opt_quiet || ! $opt_slow);
    t 'getting dates:' . d \@dates_to_get;

    $writer->start($metadata);

    # get the listings for each date
    my %categories;
    my %prog_to_cat;
    my @programmes;
    foreach my $date (@dates_to_get) {
        my @new_programmes;

        foreach my $chan (sort keys %write_channels) {
            #
            my $dn = $write_channels{$chan}->{'display-name'};
            my $name = XMLTV::best_name([ $LANG ], $dn)->[0];
            $name = $chan if not defined $name;

            # FIXME turn into progress bar.
            print STDERR 'date ', UnixDate($date, '%Y%m%d'), ", channel $name:\t"
                unless $opt_quiet;
            push @new_programmes,
            get_programmes($chan, $date, \%prog_to_cat,
                           \%categories, \%channels, $opt_input);
            print STDERR "\n" unless $opt_quiet;
        }

	# push the new channels into the completlist
	push (@programmes, @new_programmes);
    }

    # die; # die here when debugging parser

    # write out the xml
    # write out the channels
    $writer->write_channels(\%write_channels);

    #write out the programmes
    foreach (@programmes) {
        foreach my $k (keys %$_) {
            die "undef \$_->{ $k } in $_->{title}->[0]->[0]" if not defined $_->{$k};
        }
        $writer->write_programme($_);
    }
    $writer->end();
}


# Function to get a url.  This also seems like a sensible place to do
# HTML-demoronizing.
#
sub get_url( $ ) {
    my $url = shift;
    t "getting URL: $url";
    for (my $tmp = get_nice($url)) {
        die "cannot get $url" if not defined;
	$numwebgets++; #update stats
	$kbwebgets+= (length $_)/1024;
        tr/\221\222\226/''-/;
        tr/\010//d;
        # There could be other illegal chars
        return $_;
    }
}


# Function to find all the programmes on a channel (at a given date +
# time).
#
# Parameters:
#   XMLTV id of channel
#   Date::Manip object giving date and time
#   prog_to_cat hash (see elsewhere for details)
#   categories hash
#   channels hash
#   filename of test HTML file (does not read from web if ! undef)
#
# Returns: list of programmes
#
sub get_programmes( $$$$$$ ) {
    my $channel_xid = shift;
    my $origtime = shift;
    my $time = $origtime;
    my $tomorrow = DateCalc($time, '+ 1 day');
    my $prog_to_cat = shift;
    my $categories = shift;
    my $channels = shift;
    my $testinput = shift;

    my @p;

    my $data;
    my $url;
    if ( $testinput )
    {
	$url=$testinput;
	local(*INPUT, $/);
	open (INPUT, $testinput) 	|| die "can't open $testinput: $!";
	warn "using $testinput as data source";
	$data = <INPUT>;
    }
    else
    {
	$url = "$BASE_URL{$LANG}$SUMMARY_PATH{$LANG}?move=full";
	$url .= "&channel=" . xmltv_to_be($channel_xid);
	$url .= '&dag=' . UnixDate($time, '%m/%d/%Y');

	# FIXME commonize this
	local $SIG{__WARN__} = sub {
	    warn "$url: $_[0]";
	};
	local $SIG{__DIE__} = sub {
	    die "$url: $_[0]";
	};
	eval {
	    $data = get_url($url);

	    # This check is mostly for the benefit of those using --cache.
	    die 'strange, get_url() not supposed to return undef'
		if not defined $data;
	};
	if ($@) {
	    warn "could not get $url\n";
	    my $from_time = UnixDate($time, '%Q');
	    warn "not fetching any programmes for channel $channel_xid "
		. "at $from_time\n";
	    return ();
	}
    }
    $data =~ tr/[\r\n]//d;
    print STDERR '#' unless $opt_quiet;

    my @results = ($data =~ /<tr>\s*<td[^>]*class='*tvnucontent'*.*?<td[^>]*class=programmabeschrijving[^>]*[^<]*<\/td>\s*<\/tr>/ig);
    if (not @results) {
        if ($data =~ /aucun programme ne correspond|geen programma\'s gevonden /) {
            # Assume that this is because nothing is showing on that
            # channel, not because the site is missing some data.
            #
        }
        else {
            warn "$url: no results found in HTML\n";
        }
        return ();
    }

    # used later in detecting when a program is in tomorrow
    # defined here for performance
    my $time_1400 = DateCalc($origtime, '+ 12 hours');
    my $time_1000 = DateCalc($origtime, '+ 10 hours');

    foreach (@results) {
        t "\nresult: " . $_ . "\n";
        my $PATH=$DETAIL_PATH{$LANG};
        $PATH =~ s/([.?])/\\$1/g;
	my $PAT = "'$PATH([^']+)'";
        m/$PAT/i or die "\n$url: \n matching pattern $PAT cannot find progid in \n$_";
        my $programmeId = $1;


	my $progs = get_programme_summary($channel_xid,  $programmeId,$time, $_);
	if (not $progs) {
	    warn "$url: could not get programme $programmeId on channel $channel_xid\n";
	}
	elsif (not @$progs) {
	    warn "$url: strange, $programmeId on channel $channel_xid seems to be empty";
	}
	else {

	    # attempt to determine when we are in tomorrow
	    #
	    # the problem is that the listings for a 'day' actually range from
	    # 06:00 of "today" to 06:00 of "tomorrow"
	    #
	    # to detect when "Today" has become "tomorrow", we use 2 checking methods:
	    # 1) program starts before midnight and finishes after midnight...
	    #    (this is determined in get_programme_summary: Stoptime is set to tomorrow)
	    # 2) previous program stop time > 14:00; whereas this program's start time < 10:00
	    #    ( this is detected below )
	    #
	    # this relies on programs being returned by get_programme_summary() in date order
	    #
	    t 'determining tomorrow for programmes: ' . d $progs;
	    foreach ( @$progs ) {
		my $lastexistprog=$p[$#p]; #ref to hash
		my $latestprog=$_; # ref to hash
		if ( $time ne $tomorrow ) {
		    if ( ${$latestprog}{'stop'} &&
			 Date_Cmp(${$latestprog}{'stop'},$tomorrow) >= 0 )
		    {
			#if stoptime in tomorrow, assume all future progs are also in tomorrow
			t "Passing into the land of tomorrow A - stop time = ${$latestprog}{'stop'}";
			$time=$tomorrow;
		    }
		    if ( Date_Cmp(${$latestprog}{'start'},${$lastexistprog}{'stop'}) < 0 ) {
			# start/stop time overlap... possibly start time is tommorrow
			# check this -- if start time < today 10:00 and previous stop time > today 14:00
			# then this prog is probably in tomorrow!
			if ( ${$lastexistprog}{'stop'} &&
			     (Date_Cmp(${$latestprog}{'start'}, $time_1000) < 0
			      && Date_Cmp(${$lastexistprog}{'stop'}, $time_1400) > 0 ) )
			{
			    # so we have a program that starts much earlier than the pervious program stops...
			    # Methinks start time has passed into the world of tomorrow!
			    ${$latestprog}{'start'}=utc_offset(DateCalc(${$latestprog}{'start'},' + 1 day') . " UTC", '+0100');
			    die if not defined ${$latestprog}{'start'};
			    t "Passing into the land of tomorrow B - start time = ${$latestprog}{'start'}";
			    if (${$latestprog}{'stop'} &&
				Date_Cmp(${$latestprog}{'start'}, ${$latestprog}{'stop'}) > 0) {
				${$latestprog}{'stop'} = utc_offset(DateCalc(${$latestprog}{'stop'}, '+ 1 day') . " UTC", '+0100');
                                die if not defined ${$latestprog}{'stop'};
				t "Passing into the land of tomorrow C - stop time = ${$latestprog}{'stop'}";
			    }

			    $time=$tomorrow;
			}
		    }
		}
		else
		{
		    # $time eq $tomorrow

		    # we are already in tomorrow, so start and stop
		    # times must be in tomorrow range check this
		    # (necessary for other progs retrieve as a clump
		    # by get_programme_summary()

		    if ( Date_Cmp(${$latestprog}{'start'}, $tomorrow) < 0 )  {
			DateCalc(${$latestprog}{'start'}, '+ 1 day');
		    }
		    if ( ${$latestprog}{'stop'} &&
			 Date_Cmp(${$latestprog}{'stop'}, $tomorrow) < 0 ) {
			DateCalc(${$latestprog}{'stop'}, '+ 1 day');
		    }
		}
		# check for simple start/stop time overlap
		if ( not defined ${$lastexistprog}{'stop'}
		     && ${$latestprog}{'start'} ) {
		    ${$lastexistprog}{'stop'} = ${$latestprog}{'start'}
		}

		if ( ${$lastexistprog}{'stop'} &&
		     Date_Cmp(${$latestprog}{'start'},${$lastexistprog}{'stop'}) < 0 ) {
		    # start time a little before previous stop time
		    # correct previous stop time
		    warn "$url: correcting program overlap stop = ${$lastexistprog}{'stop'} -> ${$latestprog}{'start'};";
		    ${$lastexistprog}{'stop'}=${$latestprog}{'start'};
		}
		push @p, $_;
	    }
	}
    }
    return @p;
}

# Function to parse the HTML and get all the info we need
#
# Parameters:
#   XMLTV id of channel
#   Sanoma id of programme
#   Date::Manip object giving date and time
#   bit of html text- section of table from website with the 2 table rows containing program information
#
# <tr>
# <td class='tvnucontent' rowspan=2> </td>
# <td class='tvnucontent' > STARTTIME </td>
# <td class='tvnucontent' > STOPTIME </td>
# <td class='tvnucontent' > <a href='detail.asp?progid=PROGID' class=tvnu> TITLE </a></td>
# <td class='tvnuthema' > CATEGORY </td>
# <td class='tvnucontent' >   </td>
# </tr>
# <tr>
# <td class=programmabeschrijving> </td>
# <td class=programmabeschrijving> </td>
# <td colspan=2 class=programmabeschrijving> DESCRIPTION </td>
# <td align=right valign='top' class=programmabeschrijving></td>
# </tr>
#
# Returns a listref of programmes: normally with just one element,
# note, _start, _stop will be set to the TIME when the program is set
# the caller is responsable for converting these into date::manip objects
# and deciding whether it is today or tomorrow
#
sub get_programme_summary( $$$$ ) {
#    local $Log::TraceMessages::On = 1;
    my $channel_xid = shift;
    my $channelId = xmltv_to_be($channel_xid);
    my $programmeId = shift;
    my $date = shift;
    my $summaryhtml = shift;


    # @followons are small extra programmes sharing its slot.  Things
    # like news bulletins which come in the middle of a film are also
    # counted as 'after' it, for simplicity.
    #
    my @followons;

    # %p is the main programme we will return.
    my %p;
    $p{channel} = $channel_xid;
    $p{_chanID} = $channelId;
    $p{_progID} = $programmeId;


    # take summaryhtml and extract starttime, stoptime, title, genre, and description
    my @nucontent = ($summaryhtml =~ /<td[^>]+class='*tvnucontent'*.*?<\/td>/ig);

    # attempt to get star rating from title
    if ( $nucontent[3] =~ /<img[^>]*src=[\"\']*([^>]*stars\/)([0-9]{2})\.gif[\"\']*/i )
    {
	# star ratings are 00,10,15,20,25,30,35,40
        # 8 possible ratings: convert to 1->4
        my %rating = ( '00' => 0,
		       '10' => 1,
                       '15' => 1,
                       '20' => 2,
		       '25' => 2,
		       '30' => 3,
		       '35' => 3,
		       '40' => 4 );
        if ( defined $rating{$2} ) {
	    $p{'star-rating'}=[ $rating{$2} ."/4", [ { src => "$BASE_URL{$LANG}$1" . $rating{$2} . "0.gif" } ] ];
        }
	else
	{
	    warn "$programmeId: could not translate rating: $2 / 40"
	}
    }
    foreach ( @nucontent ) {
	$_=clean_html_text($_);
    }
    t "nucontent: " . d \@nucontent;
    if ( scalar(@nucontent) ge 4 ) {

	# check for episodenum in title "Stargate SG-1 (4/13)"
	# or "Stargate SG-1 (4)"
	if ( $nucontent[3] =~ /(.+)\s+(\([0-9]+(\/[0-9]+){0,1}\))/i ) {
	    t "got subtitle in title - $1 -- $2 " . d \$3;
	    $p{'title'} = [ [ $1 ] ];
	    $p{'episode-num'} = [ [ $2 ] ];
	}
	else
	{
	    $p{'title'} = [ [ $nucontent[3] ] ];
	}
        if ( not defined $p{'title'}
	     or not defined  $p{'title'}[0]
	     or not defined $p{'title'}[0][0]
	     or $p{'title'}[0][0] eq '' ) {
            warn "$programmeId: No title defined... skipping programme";
	    return undef
        }

	# strip trailing puctuation from title
	# making sure "E.R." does not become "E.R" in
	# the process!
	$p{'title'}[0][0] =~ s/([^.,:;]{2,})[.,:;]*$/$1/;

        my ($start, $start_tz);
        my ($stop, $stop_tz);
        my $pair;
        $nucontent[1] =~ s/([0-2][0-9])\./$1:/;
        $nucontent[2] =~ s/([0-2][0-9])\./$1:/;
        $nucontent[1] =~ s/24:/00:/;
        $nucontent[2] =~ s/24:/00:/;
        if ( ! $nucontent[1] =~ /[0-2]*[0-9]:[0-5]*[0-9]/ ) {
            warn "$programmeId: No start time defined... skipping programme";
	    return undef
        }
        t "start time $nucontent[1], calling utc_offset()";
	$p{start} = utc_offset(UnixDate($date, '%Y-%m-%d') . " $nucontent[1]", '+0100');
	t "turned into $p{start}";

	if ( $nucontent[2] =~ /[0-2]*[0-9]:[0-5]*[0-9]/ ) {
	     t "stop time $nucontent[2], calling utc_offset()";
	     $p{stop} = utc_offset(UnixDate($date, '%Y-%m-%d') . " $nucontent[2]", '+0100');
	     t "turned into $p{stop}";

	     # Some programmes have thir stop time on the next day.  (This test
	     # may break when the timezones change.)
	     #
	     if (Date_Cmp($p{start}, $p{stop}) > 0) {
		 t 'put stop time a day later';
		 my $n = DateCalc($p{stop}, '+ 1 day');
		 t "DateCalc() gave: $n";
		 $p{stop} = utc_offset("$n +0000", '+0100');
		 t "stop time now $p{stop}";
		 die if not defined $p{stop};
	     }
	 }
    }
    else {
        warn "$programmeId: invalid number of columns for program, skipping: \n" . d \@nucontent;
	return undef;
    }

    $p{url}= [ "$BASE_URL{$LANG}$DETAIL_PATH{$LANG}" . $programmeId ];


    my @thema = ($summaryhtml =~ /<td[^>]+class='*tvnuthema'*.*?<\/td>/ig);
    foreach ( @thema ) {
	$_=clean_html_text($_);
	push @{$p{category}}, [ $_ ] if ( $_ ne '' );
    }
    t "nuthema: " . d \@thema;

    # match for getting detailed info
    my $do_get_details;
    if ( $opt_slow
	 && defined $get_channel_detail{$channel_xid} ) {

	t "channel  selected for details";
	# check match for time range
	my $start_hhmm=$nucontent[1];
	if ( (
	       ( $detailstarttime lt $detailstoptime )
	       # normal time range
	       &&
	       ( $start_hhmm ge $detailstarttime
		 && $start_hhmm lt $detailstoptime )
	       )
	     ||
	     (
	       ( $detailstarttime ge $detailstoptime )
	       # inverted time range: 17:00-02:00 or similar
	       &&
	       ( ( $start_hhmm ge $detailstarttime
		   && $start_hhmm le "24:00" )
		 || ( $start_hhmm ge "00:00"
		      && $start_hhmm lt $detailstoptime )
		 )
	       )
	     ) {
	    t "time range selected for details";
	    # check for genre match
	  MATCHCATEG:
	    foreach my $testgenre ( @detailgenre ) {
		if ( $testgenre eq '*ALL*' ) {
		    # Magic value meaning always yes;
		    #
		    $do_get_details = 1;
		    last MATCHCATEG;
		}
		if ( $p{category} ) {
		    foreach my $categ ( @{@{$p{category}}} ) {
			t "comparing \"${$categ}[0]\" with \"$testgenre\"";
			if ( index(lc ${$categ}[0], lc $testgenre) != -1 ) {
			    $do_get_details=1;
			    last MATCHCATEG;
			}
		    }
		} elsif ( $testgenre eq '*NONE*') {
		    # Magic value meaning empty category;
		    #
		    $do_get_details = 1;
		    last MATCHCATEG;
		}
	    }
	}
    }
    if ( $do_get_details )
    {
	print STDERR '@' unless $opt_quiet;
	get_programme_detailed_info(\%p,$programmeId);
    }
    else
    {
	print STDERR '#' unless $opt_quiet;
    }

    # if no description yet, get it here
    # -- handles the case where getting details is not defined,
    # or if getting details failed
    # or if details had no description for some reason.
    if ( not defined $p{'desc'} ) {
	my $imagedescr;
	my $description;
	my @programmabeschrijving = ($summaryhtml =~ /<td[^>]+class='*programmabeschrijving'*.*?<\/td>/ig);
	parse_programme_details(\%p, \@programmabeschrijving);
    }
    t ' proginfo: ' . d \%p;

    return [ \%p, @followons ];
}

my %unknownimages;
sub parse_programme_details($$) {
    my $p = shift;  # ref to %p defined in get_programme_summary
    my $detailstrings = shift; # ref to array of descr strings

    my $description;
    my $imagedescr;
    foreach ( @{$detailstrings} ) {
	t "details " . d \$_;
	# handle translating images with alt-text
	my @images = ( /<img [^>]*>/ig );
	t "images: " . d \@images;
	foreach ( @images ) {
	    if ( /<img[^>]+src\s*=\s*[\'\"]([^\'\"]+)[\'\"]\s+alt\s*=\s*[\'\"]([^\'\"]+)[\'\"][^>]*>/i ) {
		my $imagepath=$1;
		if ( $imagepath =~ /\/gehoor.gif/i || $imagepath =~ /tt.gif/i) {
		    ${$p}{subtitles} = [ { type => 'teletext' } ];
		}
		elsif ( $imagepath =~ /\/16-9\.gif/i ) {
		    ${$p}{video}{aspect} = "16:9";
		}
		elsif ( $imagepath =~ /\/stereo\.gif/i ) {
		    ${$p}{audio}{stereo} = "stereo";
		}
		elsif ( $imagepath =~ /\/ov\.gif/i ) {
		    #  VO image, use language specific abbrev
		    my $text = $2;
		    if ( $LANG eq $LANG_FR ) {
			$text="VO";
		    } elsif ( $LANG eq $LANG_NL ) {
			$text="OV";
		    }
		    if ( defined $imagedescr ) {
			$imagedescr=$imagedescr . " (" . $text . ")";
		    }
		    else {
			$imagedescr="(" . $text . ")";
		    }
		}
		elsif ( $imagepath =~ /\/dolby\.gif/i ) {
		    ${$p}{audio}{stereo} = "surround";
		}
		elsif ( $imagepath =~ /\/black-white\.gif/i ) {
		    ${$p}{video}{colour} = 0;
		}
		elsif ( $imagepath =~ /\/tele-([0-9]+)\.gif/i ) {
		    # Age rating $1="10,12,16,18"
		    # Cert issuer cannot be determined as it depends on
		    # nationality of channel
		    if ( defined ${$p}{rating} )
		    {
			if ( defined ${$p}{rating}[0][0] ne $1 )
			{
			    warn "${$p}{_progID}: already seen different certificate";
			    push @{${$p}{rating}}, [ $1, undef, [ { src => "$BASE_URL{$LANG}$imagepath" } ] ];
			}
		    }
		    else {
			push @{${$p}{rating}}, [ $1, undef, [ { src => "$BASE_URL{$LANG}$imagepath" } ] ];
		    }
		}
		elsif ( $imagepath =~ /\/premiere\.gif/i ) {
		    ${$p}{new} = "new";
		}
		else {
		    warn "${$p}{_progID}: Unknown info image ($imagepath), putting alt-text into description: \"$2\"" unless $unknownimages{$imagepath}++;
		    # unknown image, use alt-text
		    if ( defined $imagedescr ) {
			$imagedescr=$imagedescr . " (" . $2 . ")";
		    }
		    else {
			$imagedescr="(" . $2 . ")";
		    }
		}
	    }
	}
	$_=clean_html_text($_);
	if ( $_ ne '' ) {
	    if ( defined $description ) {
		$description=$description . ' ' . $_;
	    }
	    else {
		$description=$_;
	    }
	}
    }
    if ( $imagedescr )
    {
	if ( defined $description ) {
	    $description=$description . ' ' . $imagedescr;
	}
	else {
	    $description=$imagedescr;
	}
    }
    if ( defined $description ) {
	# check for epsiode num in description
	# "something (4/13)"
	if ( $description =~ /(.+) (\([0-9]+\/[0-9]+\))/ ) {
	    if ( not defined ${$p}{'episode-num'} ) {
		${$p}{'episode-num'} = [ [ $2 ] ] ;
	    }
	}
	# look for date in description if not already found
	# "something (19xx)" or "something (20xx)"
	# ONLY MATCHES dates in years 19xx  and 20xx
	if ( not defined ${$p}{'date'} ) {
	    if ( $description =~ m/(.+) \(((19|20)[0-9][0-9])\)/ ) {
		t "got year $2 in descr $description";
		${$p}{'date'}=$2;
	    }
	}
	# if film, look for director in description
	if ( ( not defined ${$p}{'director'})
	     && ${$p}{'category'}
	     && ${$p}{'category'}->[0]
	     && ${$p}{'category'}->[0]->[0]
	     && ${$p}{'category'}->[0]->[0] =~ /film/i  )
	{
	    # description is something like
	    # FR: thriller de John Doe.
	    # NL: thriller van John Doe
	    my $whomatchregexp;
	    if ( $LANG eq $LANG_FR ) {
		$whomatchregexp="(de |d'|par )";
	    } elsif ( $LANG eq $LANG_NL ) {
		$whomatchregexp="(van )";
	    }
	    # match "John Doe."
	    # match "John H. Doe."
	    # match "John Howard Doe."
	    # match "Jean-Dominique de La Rochefoucauld" (!)
	    # so name matching regexp is (CHAR. |chars )(repeated) CHARS.
	    if ( $description =~
		 m/.*? $whomatchregexp(([A-Z]\. |[A-Za-z�-�\-]{2,} )+[A-Z�-�a-z\-]{2,})\./ ) {
		t "got director $2 in descr $description";
		push @{${p}->{credits}->{director}}, $2;
	    }
	}

	# check for duration in description (French only so far --
	# can't seem to find equivalent in Flemish listings
	if ( $LANG eq $LANG_FR
	     && $description =~ m/(.*) *Dur�e: +([0-9]+)\'[. ]*(.*)/i )
	{
	    ${$p}{'length'} = $2 * 60;

	    $description = $1;
	    $description .= " " . $3 if ( $3 );
	}

	# check for preogramme is a repeat flag in description
	# match (R. something) at *end* of text
	# will match:
	#   (R. du film de la soir�e)
	#   (R. de samedi)
	#   (R. d'hier)
	#   (R.)
	# will not match:
	#  (R): 5 V.O., 11, 13, 20 V.O., 25
	#  (R. � 17.30 et 24.00)
        # which is Canal Plus' future showing dates and MCM's future
        # showing times
	if ( $description =~ m/(.*) *\(R\.\)[ .]*$/ )
	{
	    # Plain (R.)  with no extra info.
	    # Strip (R.)
	    ${$p}{'previously-shown'} = {};
	    $description = $1;
	}
	elsif ( $description =~ m/\(R\. [^�][^\)]+\)[ .]*$/ )
	{
	    # Repeat with extra info keep info
	    ${$p}{'previously-shown'} = {};
	}


	# Compare one-word descriptions to categories and strip if matched
	$description =~ s/ +$//;
	if ( $description ne ""
	     && $description !~ m/[ ()-]/ ) {
	    $description =~ s/[ -_\.,]*$//;
	    if ( ${$p}{category} ) {
	      MATCHCATEG:
		foreach my $categ ( @{@{${$p}{category}}} ) {
		    t "stripping desc -- duplicate categ \"${$categ}[0]\" -- \"$description\"";
		    if ( index(lc ${$categ}[0], lc $description) != -1 ) {
			# Desc is subsrtring of category... strip desc
			$description = undef;
			last MATCHCATEG;
		    }
		    elsif ( index(lc $description, lc ${$categ}[0]) != -1 ) {
			# category is sub-string of desc... move desc to categoty
			push @{${$p}{category}}, [ $description ];
			$description = undef;
			last MATCHCATEG;
		    }

		}
	    }
	}


	# Short descr: prepend to desc if present
	if ( $description ) {
	    if (${$p}{'desc'}
		&& ${$p}{'desc'}->[0]
		&& ${$p}{'desc'}->[0]->[0] )
	    {
		${$p}{'desc'}->[0]->[0] = $description . ' ' . ${$p}{'desc'}->[0]->[0];
	    }
	    else
	    {
		${$p}{'desc'} = [ [ $description, $LANG ] ];
	    }
	}
    }
}


my $warned_discarding_parts;
my %warn_others;
sub get_programme_detailed_info($$) {
    my $p = shift; # ref to %p defined in get_programme_summary
    my $programmeId = shift;
    my $data;

    # if this func fails, warn, and undef ${$p}{'desc'} -- the
    # get_programme_summary will get description from the summary
    t "getting details for ${$p}{'title'}->[0]->[0] at ${$p}{'start'}";

    my $url = "$BASE_URL{$LANG}$DETAIL_PATH{$LANG}" . $programmeId;

    # FIXME commonize this
    local $SIG{__WARN__} = sub {
	warn "$url: $_[0]";
    };
    local $SIG{__DIE__} = sub {
	die "$url: $_[0]";
    };
    eval {
	$data = get_url($url);

	# This check is mostly for the benefit of those using --cache.
	die 'strange, get_url() not supposed to return undef'
	    if not defined $data;
    };
    if ($@) {
	warn "could not get $url\n";
	my $from_time = UnixDate(${$p}{'start'}, '%Q');
	warn "not fetching detailed info for programme ${$p}{'title'}->[0]->[0] "
            . "for channel ${$p}{'channel'}"
	    . "at $from_time\n";
	return ();
    }
    $data =~ tr/[\r\n]//d;

    # details are in table with rows:
    # <tr>
    # <td class=detailtitels valign='top' nowrap> DETAILTYPE </td>
    # <td class=detailtitels valign='top' nowrap>:</td>
    # <td class=detailcontent valign='top' width='100%'> DETAIL DESCRIPTION </td>
    # </tr>


    my @results = ($data =~/<tr>\s*<td class=\'*detailtitels .*?<\/tr>/ig);

    t "results" . d \@results;
    my @detailstringsarr;

    foreach (@results) {
	my $detailtype;
	my $detailcontent;
	if ( m/<td\s+class=\'*detailtitels[^>]+>([^<]+)<\/td>/i ) {
	    $detailtype=clean_html_text($1);
	}
	if (m/<td\s+class=\'*detailcontent[^>]+>(.*?)<\/td>\s*<\/tr>/i) {
	    $detailcontent=$1;
	}

	t "detailtype = $detailtype, detailcontent = " . d \$detailcontent;

	if ( not defined $detailtype ) {
	    warn "$url: Could not extract details from $_";
	}
	elsif ( not defined $detailcontent ) {
	    warn "$url: Could not extract details from $_";
	}
	elsif ( $detailtype =~ /^(la cha�ne|zender)$/i
		|| $detailtype =~ /^(la date|datum)$/i
		|| $detailtype =~ /^(le d[e�]but|begintijd)$/i
		|| $detailtype =~ /^(la fin|eindtijd)$/i
		|| $detailtype =~ /^(le )*genre$/i ) {
	    # already handled
	}
	elsif ( $detailtype =~ /^(info)$/i ) { # description, including images
	    push @detailstringsarr, $detailcontent ;
	}
	elsif ($detailtype =~ /^(Divers)$/i ){ # dometimes contains (R.) info
	    push @detailstringsarr, $detailcontent ;
	}
	elsif ( $detailtype =~ /^(inhoud|contenu)$/i ) { # detailed description
	    $detailcontent=clean_html_text($detailcontent);
	    if ( ${$p}{'desc'}
		 && ${$p}{'desc'}->[0]
		 && ${$p}{'desc'}->[0]->[0] )
	    {
		${$p}{'desc'}->[0]->[0] = ${$p}{'desc'}->[0]->[0] . ' ' . $detailcontent;
	    }
	    else
	    {
		${$p}{'desc'} = [ [ $detailcontent, $LANG ] ];
	    }
	}
	elsif ($detailtype =~ /^(Acteurs)$/i ) {
	    # actor (part), actor (part) ea
	    $detailcontent=clean_html_text($detailcontent);
	    # remove gastactor:  gastactrice: de stemming van: etc
	    $detailcontent =~ s/[.,:; ]*[^,:]+:/, /i;

	    # remove 'and'
	    $detailcontent =~ s/\s+en\s+/, /i if ($LANG eq $LANG_NL);
	    $detailcontent =~ s/\s+et\s+/, /i if ($LANG eq $LANG_FR);

	    # remove 'e.a.'
	    $detailcontent =~ s/\s+e\.a\.\s*$//i;

	    # add a comma at the end for easy parsing later!
	    $detailcontent =~ s/\s*,*\s*$/,/i;

	    # process each "actor (part)*," block
	    foreach ( $detailcontent =~ /[^,;]+/g ) {
		if ( m/\s*([^\(,]+?)\s+\(([^\),]+)\)[\s,;]*/ ) {
		    t "actor $_ => $1 -- $2";
		    #$1 = actor, $2 = part
		    warn "discarding information about the parts played by each actor\n"
			unless $warned_discarding_parts++;

		    push @{${p}->{credits}->{actor}}, $1;
		} else {
		    s/^\s+//; s/\s+$//;
		    t "actor $_ -- (no part)";;
		    push @{${p}->{credits}->{actor}}, $_;
		}
	    }
	}
	elsif ($detailtype =~ /^(Jaar|Ann�e)$/i ) {
	    $detailcontent=clean_html_text($detailcontent);
	    ${$p}{'date'}=$detailcontent;
	}
	elsif ($detailtype =~ /^(land|pays)$/i ){
	    $detailcontent=clean_html_text($detailcontent);
	    ${$p}{'country'}=[ [ $detailcontent,  $LANG ] ];
	}
 	elsif ($detailtype =~ /^(Aflevering|�pisode)$/i ){
 	    $detailcontent=clean_html_text($detailcontent);
 	    # epsiode number assign to subtitle if not already defined
	    # and to episode num
 	    ${$p}{'episode-num'} = [ [ $detailcontent ] ] ;
 	}
	elsif ($detailtype =~ /^(Afleveringstitel|Titre de l\'�pisode)$/i ){
	    $detailcontent=clean_html_text($detailcontent);
	    # episode name
	    ${$p}{'sub-title'} =  [ [ $detailcontent,  $LANG ] ];
	}
	else {
	    warn "found unknown details tag $detailtype" unless $warn_others{$detailtype}++;
	}
    }
    parse_programme_details($p, \@detailstringsarr) if ( @detailstringsarr );
}


my %warn_windowschars;
sub clean_html_text( $ ) {
    local $_ = shift;
    t "original string" . d \$_;
    # br to newline
    s/<br[^>]*>/\r\n/g;
    # remaining tags to spaces
    s/<\/*[^>]*>/ /g;

    # decode any HTML special chars (&amp; &nbsp;)
    decode_entities($_);
    # note &nbsp; -> \240 -> space


    # get rid of known Windows encoded characters
    # silly windows characters to simple quotes
    tr/\221\222\223\224\226\227/\'\'\"\"\-\-/;
    tr/\010//d;

    # replace invalid windows chars oe ligatures
    s/\234/oe/g;
    s/\214/OE/g;
    # replace windows' "..." character
    s/\205/.../g;

    foreach ( m/[\200-\237]/g ) {
	warn "stripping invalid windows character (" . ord($_) . " - $_) from input: $_" unless $warn_windowschars{ord($_)}++;
    }
    s/[\200-\237]/\?/g;

    # multiple spaces to one space
    s/[\240\s]+/ /g;
    # trim leading and trailing spaces
    s/^\s+//;
    s/\s+$//;
    t "cleaned string" . d \$_;
    return $_;
}

# Function which will locate all the available channels and return a hash
# with channelId as the key and a channel description.
#
sub get_channels() {
    my $data;
    eval {
        $data = get_url("$BASE_URL{$LANG}$SUMMARY_PATH{$LANG}");
        die 'strange, get_url() not supposed to return undef'
          if not defined $data;
    };
    if ($@) {
        die "could not get channels page $BASE_URL{$LANG}$SUMMARY_PATH{$LANG}, aborting\n";
    }
    $data =~ tr/\n\r/\n/ds;
    t 'got channels page: ' . d $data;
    $data =~ s/\n//g;
    $data =~ /<select class=PersoFormSelect size='1' name='channel'[^>]*?>(.*?)<\/select>/
      or die "cannot find channel string in HTML $data";
    my $channel_string = $1;
    t 'got string of channels: ' . d $channel_string;
    $channel_string =~ s/\s+/ /g;
    $channel_string =~ s/<option value=''[^>]*>[^<]*//ig;
    t 'cleanedup string of channels: ' . d $channel_string;
    my @channels = ($channel_string =~ /<option value='[^\']+'[^>]*>[^<]*/ig);
    t 'channels in string: ' . d @channels;
    warn "no channels found in $channel_string" if not @channels;
    my %c;

    foreach (@channels) {
        t 'doing channel string: ' . d $_;
        m/'([^\']+)'/ or die "cannot find sanoma channel id in $_";
        my $channelId = $1;
        t 'got sanoma id: ' . d $channelId;
        m/>(.*)/ or die "cannot find channel description in $_";
        my $channelDesc = $1;
        for ($channelDesc) {
            s/^\s+//; s/\s+$//;
        }
        t 'got description: ' . d $channelDesc;
        my $chanID_to_output = be_to_xmltv($channelId);
        t 'XMLTV id to use: ' . d $chanID_to_output;
        die if not defined $chanID_to_output;
        die if not defined $channelId;

        my @dns = ([ $channelDesc, $LANG ]);
        my $extra_dn = $extra_dn{$LANG}{$chanID_to_output};
        push @dns, [ $extra_dn ] if defined $extra_dn;
	my $ch = { 'display-name' => \@dns,
		    'id' => $chanID_to_output, };

	${$ch}{'icon'} = [ { src => "http://" . $logourl{$LANG}{$chanID_to_output} } ]
	    if ( defined  $logourl{$LANG}{$chanID_to_output} );
        t 'channel object: ' . d $ch;
        $c{$chanID_to_output} = $ch;
        t "added to channels hash under key $chanID_to_output";
    }

    t 'returning hash: ' . d \%c;
    return %c;
}


# Function which will locate all the available dates and return a list
# of Date::Manip objects, one for each day.
#
# (I was tempted to make this a hash (so you could say $available{$d}
# to see if a day exists) but string equality is a bit dirty for
# comparing two Date::Manip objects.  There needs to be a tied hash
# class which can use a specified equality operation.)
#
sub get_available_dates() {
    my @r;
    my $url = "$BASE_URL{$LANG}$SUMMARY_PATH{$LANG}?channel=$DATE_CH{$LANG}";
    my $data;
    eval {
        $data = get_url($url);
        die 'strange, get_url() not supposed to return undef'
          if not defined $data;
    };
    if ($@) {
        die "could not get $url, so cannot find available dates, aborting\n";
    }

    $data =~ s/\n//g;
    $data =~ /<select class=PersoFormSelect size='1' name='dag'[^>]*?>(.*?)<\/select>/
      or die "cannot find searchDate string in HTML $data";
    local $_ = $1;
    s/&nbsp;/ /g;
    s/\s+/ /g;
    s/^\s*//;
    t 'date string: ' . d $_;
    while (length) {
        if (not s!<option value='(\d{1,2})/(\d{1,2})/(\d{4})'[^>]*>[^<]*</option>\s*!!i) {
            warn "remnant junk in date string: $_";
            return @r;
        }

        my $val = "$1/$2/$3";
        my $text = $4;

        my $date_from_val = "$3-$1-$2";
        my $parsed_val = parse_date($date_from_val);

        push @r, $parsed_val;
    }
    return @r;
}

sub be_to_xmltv( $ ) {
    my $n = shift;
    die "undef \$LANG" if not defined $LANG;
    if (not defined $be_to_xmltv{$LANG}{$n}) {
        my $new = (lc $n) . ".$DOMAIN{$LANG}";
        $new =~ s/ //g;
        warn "$DOMAIN{$LANG} Channel id $n not found in channel_ids_${LANG} file, assuming XMLTV id $new\n";
        die "channel id $new already exists" if defined $xmltv_to_be{$LANG}{$new};
        $be_to_xmltv{$LANG}{$n} = $new;
        $xmltv_to_be{$LANG}{$new} = $n;
    }
    return $be_to_xmltv{$LANG}{$n};
}
sub xmltv_to_be( $ ) {
    my $x = shift;
    die "undef \$LANG" if not defined $LANG;
    for ($xmltv_to_be{$LANG}{$x}) {
        die "no $DOMAIN{$LANG} id known for $x" if not defined;
        return $_;
    }
}


# Ask the user which channels to download, and write $config_file.
#
# Uses global %channels hash.
#
sub configure() {
#    local $Log::TraceMessages::On = 1;

    # only lots of page fetches in slow mode!
    if ($opt_slow && not ask_boolean( <<END
Warning: this grabber requires a large number of page fetches from a
human-readable website.

Proceed with configuration?
END
				     , 0)) {
	say("Exiting.\n");
	exit 0;
    }

    XMLTV::Config_file::check_no_overwrite($config_file);

    # FIXME need to make directory
    open(CONF, ">$config_file") or die "cannot write to $config_file: $!";
    print CONF <<END
\#
\# tv_grab_be config file.
\#
\# Format is:
\# language <$LANG_FR|$LANG_NL>
\# detailstarttime <24hr clock time>
\# detailstoptime <24hr clock time>
\# detailgenre <genre regexp>    \#- may be repeated mutiple times
\# channel <xmltv_id> <fr|nl> [dodetail] \#- may be repeated mutiple times
\#
END
;

    for (;;) {
        my $in = ask_choice('Enter the language required (Note: dutch is no longer supported)','French', ('French'));

	die "could not read answer\n" if not defined $in;
	# handle backspace (^H)
	$in =~ s/.\x08//g;
        $in = uc $in;
        if ( $in eq 'FRENCH' ) {
            $LANG=$LANG_FR;
            last;
        }
        elsif ( $in eq 'DUTCH' ) {
            say("Dutch language grabbing is no longer supported, because Teveblad blocks xmltv!.");
            last;
        }
        else {
            say("'$in' is not 'French', try again!.");
        }
    }
    print CONF <<END

\#
\# definition of language mode: $LANG_FR or $LANG_NL
\#
END
;
    print CONF "language $LANG\n";

    if ( $opt_slow )
    {

	say(<<END
Configuring with --slow:

Detailed information grabbing will require 1 web page get for every
program. This is slow, hard work on the web-server, and may upset the
listings provider...

To limit this, there are three selection critera:
Time range (only programs between 16:00  and 00:00)
category (only Series, magazines, films, telefilms)
Channel (only get detailed info for La Une, La Deux, KA2 and VT4)

If all citeria match, then program detail will be obtained (program
will show up as \@ instead of \# on progress bar)

Note: for time range, the early hours of the morning are assumed to be in the same 'day' as the late ours of the night, so 17:00-02:00 is a valid range.

Note: for category, regular expressions are allowed.
END
	    );

	# Time range loop:
	my $starttime;
	my $stoptime;
      TIME_RANGE_LOOP:
	for (;;) {
	  START_LOOP: for (;;) {
	      $starttime = ask("Enter a starting time for grabbing detail (24h format [17:00])");
	      die "could not read answer\n" if not defined $starttime;
	      if ( $starttime eq "" ) {
		  $starttime="17:00";
		  last START_LOOP;
	      }
	      else {
		  $starttime =~ s/^\s+//; $starttime =~ s/\s+$//;
		  if ( $starttime =~ /^([0-9]{2}):([0-9]{2})$/
		       && $1 ge 0 && $1 lt 24
		       && $2 ge 0 && $2 lt 60 ) {
		      last START_LOOP;
		  }
		  say ( "Invalid time format: $starttime");
	      }
	  }

	  STOP_LOOP:
	    for (;;) {
		$stoptime = ask("Enter an ending time for grabbing detail (24h format: [02:00])");
		die "could not read answer\n" if not defined $stoptime;
		if ( $stoptime eq "" ) {
		    $stoptime="02:00";
		    last STOP_LOOP;
		}
		else {
		    $stoptime =~ s/^\s+//; $stoptime =~ s/\s+$//;
		    if ( $stoptime =~ /^([0-9]{2}):([0-9]{2})$/
			 && $1 ge 0 && $1 lt 24
			 && $2 ge 0 && $2 lt 60 ) {
			last STOP_LOOP;
		    }
		    say ( "Invalid time format: $stoptime");
		}
	    }
	    last TIME_RANGE_LOOP;
	}
	print CONF <<END

\#
\# definition of start and stop times for retrieving detailed
\# information for programmes. Times must be in 24 hour clock
\# and may overlap a day bounday (eg 17:00 - 02:00)
\#
END
;
	print CONF "detailstarttime $starttime\n";
	print CONF "detailstoptime $stoptime\n";

	my $example_categs = "";
	if ( $LANG eq $LANG_FR ) {
	    $example_categs=
		"actualit�, court m�trage, divertissement, documentaire, enfant,\n" .
		"film, football, jeu, journal, magazine, musique, sport, s�rie,\n" .
		"talkshow, th��tre, t�l�film.";
	}
	if ( $LANG eq $LANG_NL ) {
	    $example_categs=
		"actua, documentaire, film, kinderprogramma, miniserie, muziek,\n" .
		"nieuws, quiz, serie, soap, spelprogramma, sport, talkshow,\n" .
		"tekenfilm, tekenfilm kind, tvfilm, wielrennen.";
	}
	say(<<END
Enter a list of program categories (genres) These will be sub-string
matched against the Genre column on the daily channel listing page of
TeleMoustique/TeveBlad.

Example categories are:
$example_categs

eg: *ALL* -- match all categories (use with care!)
    *NONE* -- match completely blank categories.
    film -- will match "tvfilm", "telefilm", as well as "film"
END
	    );

	print CONF <<END

\#
\# definition of genres/category substrings to get detailed information for
\# multiple detailgenre lines can be defined
\# eg: *ALL* -- match all categories (use with care!)
\#    *NONE* -- match completely blank categories.
\#    FILM -- will match tvfilm, telefilm, as well as film
\#
END
;
	for (;;) {
	    my $in = ask(<<END
Enter a Genre, or "." to finish:
END
			 );
	    # interpret EOF as '.'
	    last if not defined $in;
	    # handle backspace (^H)
	    $in =~ s/.\x08//g;
	    $in = uc $in;
	    last if $in eq '.';
	    if ( $in eq "" || $in eq '^$' ) {
		say ('Ignoring empty input: Use "*NONE*" to match a blank category');
		$in="";
	    }
	    if ( $in =~ m/[]/ ) {
		say('control characters not allowed -- try again');
		$in="";
	    }
	    print CONF "detailgenre $in\n" if ( $in ne "" );
	}
    }
    else
    {

	# Slow mode not specified... Write dummy config comments, and
	print CONF <<END
\#
\# definition of start and stop times for retrieving detailed
\# information for programmes. Times must be in 24 hour clock
\# and my overlap a day bounday (eg 17:00 - 02:00)
\#
\# Configured without --slow flag; detailstarttime and detailstoptime not specified

\#
\# definition of genres/category substrings to get detailed information for
\# multiple detailgenre lines can be defined
\# eg: *ALL* -- match all categories (use with care!)
\#    *NONE* -- match completely blank categories.
\#    FILM -- will match tvfilm, telefilm, as well as film
\#
\# Configured without --slow flag; detailgenre list not specified
END
;

	# print out a message
	say( <<END
The Default configuration for this grabber is to only grab the summary
information for programmes. (channel/start/title/brief description)

If you want detailed information (episode name, detailed description,
actors) then you must re-configure and run this grabber with the
--slow option
END
	     );
    }

    # FIXME turn into progress bar.
    print STDERR "finding channels from $DOMAIN{$LANG} :\t";
    my %channels = get_channels();
    print STDERR "got " . (scalar keys %channels) . ", done.\n";

    my %chose_ch;
    t 'channels: ' . d \%channels;

    print CONF <<END

\#
\# definition of channels to grab, and whether to grab detailed info for the channel
\# multiple channel lines can be defined as:
\#   channel xmltv.channel.id language dodetail
\# or (if no detail required)
\#   channel xmltv.channel.id
\#
\# where language is fr or nl (for future use: currently ignored)
\#
END
;
    # nielm 25/4/2007 convert to ask_many_boolean
    my @questions;
    my @chan_ids=keys %channels;
    t 'channel ids: ' . d \@chan_ids;
    @chan_ids=sort {$channels{$a}->{'display-name'}->[0]->[0] cmp $channels{$b}->{'display-name'}->[0]->[0] } keys %channels;
    t 'sorted channel ids: ' . d \@chan_ids;

    foreach my $k (@chan_ids) {
        push @questions, "Add channel ".$channels{$k}->{'display-name'}->[0]->[0]."? ";
    }
    t 'questions ' . d \@questions;
    my @answers = ask_many_boolean(0, @questions);
    t 'answers ' . d \@answers;

    for (my $i=0; $i < $#answers; $i++) {
        if ($answers[$i]) {
	    my $xmltv_id=$chan_ids[$i];
	    t 'selected chanel '.$i . ' id ' .  $xmltv_id . d \$channels{$xmltv_id};

	    if ( defined $ch_warn{$LANG}{$xmltv_id} ) {
		if (ask_boolean( <<END
Warning for $channels{$xmltv_id}->{'display-name'}->[0]->[0] :
$ch_warn{$LANG}{$xmltv_id}

Confirm add channel: $channels{$xmltv_id}->{'display-name'}->[0]->[0] ?
END
				 ,0)){

		    if ($opt_slow
			&& ask_boolean(
				       'Get detailed info for channel '
				       .$channels{$xmltv_id}->{'display-name'}->[0]->[0]
				       .'?'
				       ,0)) {
			print CONF "channel $xmltv_id $LANG dodetail\n";
		    }
		    else {
			print CONF "channel $xmltv_id $LANG\n";
		    }
		}
	    }
	    else
	    {
		$chose_ch{$xmltv_id}++;
		if ($opt_slow
		    && ask_boolean(
				   'Get detailed info for channel '.
				   $channels{$xmltv_id}->{'display-name'}->[0]->[0]
				   .'?'
				   ,1)) {
		    print CONF "channel $xmltv_id $LANG dodetail\n";
		}
		else {
		    print CONF "channel $xmltv_id $LANG\n";
		}
	    }
        }
    }

    close CONF or warn "cannot close $config_file: $!";
    say("Finished configuration.");
    exit();
}