1#!/usr/local/bin/perl -w 2 3=pod 4 5=head1 NAME 6 7tv_imdb - Augment XMLTV listings files with imdb.com data. 8 9=head1 SYNOPSIS 10 11tv_imdb --imdbdir <dir> [--help] [--quiet] [--download] 12 [--prepStage (1-9,all)] 13 14tv_imdb --imdbdir <dir> [--help] [--quiet] 15 [--with-keywords] [--with-plot] 16 [--movies-only] [--actors NUMBER] 17 [--stats] [--debug] 18 [--output FILE] [FILE...] 19 20tv_imdb --imdbdir <dir> 21 --validate-title 'movie title' 22 --validate-year 2004 23 [--with-keywords] [--with-plot] 24 [--debug] 25 26=head1 DESCRIPTION 27 28Very similar to tv_cat in semantics (see tv_cat), 29except whenever a programme appears with "date" entry the 30title and date are used to look up extra data by using the 31XMLTV::IMDB package. 32 33B<--output FILE> write to FILE rather than standard output. 34 35B<--with-keywords> include IDMb keywords in the output file. 36 37B<--with-plot> include IDMb plot summary in the output file. 38 39B<--actors NUMBER> number of actors from IMDb to add (default=3). 40 41B<--quiet> disable all status messages (that normally appear on stderr). 42 43B<--download> try to download data files if they are missing (in --prepStage). 44 45B<--stats> output grab stats (stats output disabled in --quiet mode). 46 47B<--debug> output info from movie matching 48 49B<--movies-only> only augment programs that look like movie listings (4 digit 50E<39>dateE<39> field). 51 52All programs are checked against imdb.com data (unless --movies-only is used). 53 54For the purposes of tv_imdb, an "exact" match is defined as a case 55insensitive match against imdb.com data (which may or may not include the 56transformation of E<39>&E<39> to E<39>andE<39> and vice-versa. 57 58If the program includes a 4 digit E<39>dateE<39> field the following 59matches are attempted, with the first successful match being used: 60 61B<1.> an "exact" title/year match against movie titles is done 62 63B<2.> an "exact" title match against tv series (and tv mini series) 64 65B<3.> an "exact" title match against movie titles with production dates 66within 2 years of the E<39>dateE<39> value. 67 68Unless --movies-only is used, if the program does not include a 4 digit 69E<39>dateE<39> field the following 70matches are attempted, the first succeeding match is used: 71 72B<1.> an "exact" title match against tv series (and tv mini series) 73 74When a match is found in the imdb.com data the following is applied: 75 76B<1.> the E<39>titleE<39> field is set to match exactly the title from the 77imdb.com data. This includes modification of the case to match and any 78transformations mentioned above. 79 80B<2.> if the match is a movie, the E<39>dateE<39> field is set to imdb.com 814 digit year of production. 82 83B<3.> the type of match found (Movie, TV Movie, Video Movie, TV Series, 84or TV Mini Series) is placed in the E<39>categoriesE<39> field. 85 86B<4.> the url to the www.imdb.com page is added 87 88B<5.> the director is added if the match was a movie or if only one director 89is listed in the imdb.com data (because some tv series have > 30 directors) 90 91B<6.> the top 3 billing actors are added (use -actors [num] to adjust). 92 93B<7.> genres added to E<39>categoriesE<39> field (current list of genres are 94Short, Drama, Comedy, Documentary, Animation, Adult, Action, Family, Romance, 95Crime, Thriller, Musical, Adventure, Western, Horror, Sci-Fi, Fantasy, Mystery, 96War, Film-Noir, Music 97 98B<8.> imdb user ratings added to E<39>star-ratingsE<39> field. 99 100B<9.> imdb keywords added to E<39>keywordE<39> fields (if --with-keywords used). 101 102B<10.> imdb plot summary is added (if --with-plot used). 103 104=head1 HOWTO 105In order to use tv_imdb, you need: 106 107B<1.> choose a directory location to use for the tv_imdb database (youE<39>ll 108need about 1 GB of free space), 109 110B<2a.> run E<39>tv_imdb --imdbdir <dir> --prepStage all --downloadE<39> 111to download the list files from imdb.com. Or, 112 113B<2b> If you have a slow network connection you may prefer to omit 114the '--download' flag and be prompted for what you need to download by 115hand. See <http://www.imdb.com/interfaces> for the download sites. 116Then once you have the files rerun without '--download'. 117 118Note: '--prepStage' sucks a bit of memeory, but you can run each 119prepStage separately by running --prepStage with each of the stages 120(see --help for details). 121 122B<3.> Once you have the database loaded try 123E<39>cat tv.xml | tv_imdb --imdbdir <dir> > tv1.xmlE<39>. 124 125Feel free to report any problems with these steps to xmltv-devel@lists.sf.net. 126 127=head1 TESTING 128 129The --validate-title and --validate-year flags can be used to validate the 130information in the tv_imdb database. For exmple: 131 132 tv_imdb --imdbdir . --validate-title 'Army of Darness' --validate-year 1994 133 134=head1 BUGS 135 136The '--prepStage' needs a lot of memory to run at a reasonable speed, 137over 250 megabytes with the current imdb data files. For there to be 138250 megabytes free for tv_imdb, the system will need at least 512 megabytes 139of RAM. Running with less can take hours (or days!) - although fortunately 140this stage needs to be run only once after downloading the data files. 141 142Could use a --configure step just like the grabbers so you do not have 143to specify the --imdbdir on the command line every time. Also this could 144step you through the prep stages with more description of what is being 145done and what is required. Configure could also control the number of 146actors to add (since some movies have an awful lot), currently we are 147adding the top 3. 148 149How and what to look up needs to be option driven. 150 151Needs some more controls for fine tuning "close" matches. For 152instance, currently it looks like the North America grabber only has 153date entries for movies, but the imdb.com data contains made for video 154movies as well as as real movies, ot is itE<39>s possible to get the 155wrong data to be inserted. In this case we may want to say "ignore tv 156series" and "ignore tv mini series". Along with this, weE<39>d want 157to define what a "close" match is. For instance does a movie by the 158same title with a date out by 1 year or 2 years considered a match 159(currently weE<39>re using 2). 160 161Nice to haves include: verification/addition of programe MPAA/VCHIP ratings, 162addition of imdb.com user ratings (by votes) to programes. Potenially we 163could expand to include "country of origin", "description", "writer" and 164"producer" credits, maybe even "commentator". 165 166Heh, if the XMLTV.dtd supported it, we could even include urls to head 167shots of the actors :) 168 169=head1 SEE ALSO 170 171L<xmltv(5)> 172 173=head1 AUTHOR 174 175Jerry Veldhuis, jerry@matilda.com 176 177=cut 178 179use strict; 180use XMLTV::Version '$Id: tv_imdb,v 1.36 2017/05/23 12:51:33 bilbo_uk Exp $ '; 181use Data::Dumper; 182use Getopt::Long; 183 184use XMLTV; 185use XMLTV::Data::Recursive::Encode; 186use XMLTV::Usage <<END 187$0: augment listings with data from imdb.com 188$0 --imdbdir <dir> [--help] [--quiet] [--download] [--prepStage (1-9,all)] 189$0 --imdbdir <dir> [--help] [--quiet] [--download] [--with-keywords] [--with-plot] [--movies-only] [--actors NUMBER] [--stats] [--debug] [--output FILE] [FILE...] 190 191END 192; 193use XMLTV::IMDB; 194 195my ($opt_help, 196 $opt_output, 197 $opt_prepStage, 198 $opt_imdbDir, 199 $opt_quiet, 200 $opt_download, 201 $opt_stats, 202 $opt_debug, 203 $opt_movies_only, 204 $opt_with_keywords, 205 $opt_with_plot, 206 $opt_num_actors, 207 $opt_validate_title, 208 $opt_validate_year, 209 ); 210 211GetOptions('help' => \$opt_help, 212 'output=s' => \$opt_output, 213 'prepStage=s' => \$opt_prepStage, 214 'imdbdir=s' => \$opt_imdbDir, 215 'with-keywords' => \$opt_with_keywords, 216 'with-plot' => \$opt_with_plot, 217 'movies-only' => \$opt_movies_only, 218 'actors=s' => \$opt_num_actors, 219 'quiet' => \$opt_quiet, 220 'download' => \$opt_download, 221 'stats' => \$opt_stats, 222 'debug+' => \$opt_debug, 223 'validate-title=s' => \$opt_validate_title, 224 'validate-year=s' => \$opt_validate_year, 225 ) or usage(0); 226 227usage(1) if $opt_help; 228usage(1) if ( not defined($opt_imdbDir) ); 229 230$opt_with_keywords=0 if ( !defined($opt_with_keywords) ); 231$opt_with_plot=0 if ( !defined($opt_with_plot) ); 232$opt_num_actors=3 if ( !defined($opt_num_actors) ); 233$opt_movies_only=0 if ( !defined($opt_movies_only) ); 234$opt_debug=0 if ( !defined($opt_debug) ); 235 236$opt_quiet=(defined($opt_quiet)); 237if ( !defined($opt_stats) ) { 238 $opt_stats=!$opt_quiet; 239} 240else { 241 $opt_stats=(defined($opt_stats)); 242} 243$opt_debug=0 if $opt_quiet; 244 245if ( defined($opt_prepStage) ) { 246 print STDERR <<END 247Building indices. Be warned, this needs a lot of memory for the final stage 248(working set about 250 megabytes). 249 250END 251 if ( ! $opt_quiet ) ; 252 253 my %options = 254 ('imdbDir' => $opt_imdbDir, 255 'verbose' => !$opt_quiet, 256 'showProgressBar' => !$opt_quiet, 257 'stageToRun' => $opt_prepStage, 258 'downloadMissingFiles' => $opt_download, 259 ); 260 261 if ( $opt_prepStage eq "all" ) { 262 for (my $stage=1 ; $stage <= 9 ; $stage++ ) { 263 my $n=new XMLTV::IMDB::Crunch(%options); 264 if ( !$n ) { 265 exit(1); 266 } 267 my $ret=$n->crunchStage($stage); 268 if ( $ret != 0 ) { 269 exit($ret); 270 } 271 } 272 print STDERR "database load complete, let the games begin !\n" if ( !$opt_quiet); 273 exit(0); 274 } 275 else { 276 my $n=new XMLTV::IMDB::Crunch(%options); 277 if ( !$n ) { 278 exit(1); 279 } 280 my $ret=$n->crunchStage(int($opt_prepStage)); 281 if ( $ret == 0 && int($opt_prepStage) == 9 ) { 282 print STDERR "database load complete, let the games begin !\n" if ( !$opt_quiet); 283 } 284 exit($ret); 285 } 286} 287elsif ( $opt_download ) { 288 my %options = 289 ('imdbDir' => $opt_imdbDir, 290 'verbose' => !$opt_quiet, 291 'showProgressBar' => !$opt_quiet, 292 'stageToRun' => 'all', 293 'downloadMissingFiles' => $opt_download, 294 ); 295 296 my $n=new XMLTV::IMDB::Crunch(%options); 297 if ( !$n ) { 298 exit(1); 299 } 300 exit(0); 301} 302 303my $imdb=new XMLTV::IMDB('imdbDir' => $opt_imdbDir, 304 'verbose' => $opt_debug, 305 'cacheLookups' => 1, 306 'cacheLookupSize' => 1000, 307 'updateKeywords' => $opt_with_keywords, 308 'updatePlot' => $opt_with_plot, 309 'numActors' => $opt_num_actors, 310 ); 311 312#$imdb->{verbose}++; 313 314if ( my $errline=$imdb->sanityCheckDatabase() ) { 315 print STDERR "$errline"; 316 print STDERR "tv_imdb: you need to use --prepStage to rebuild\n"; 317 exit(1); 318} 319 320if ( !$imdb->openMovieIndex() ) { 321 print STDERR "tv_imdb: open database failed\n"; 322 exit(1); 323} 324 325if ( defined($opt_validate_title) != defined($opt_validate_year) ) { 326 print STDERR "tv_imdb: both --validate-title and --validate-year must be used together\n"; 327 exit(1); 328} 329 330if ( defined($opt_validate_title) && defined($opt_validate_year) ) { 331 my $prog; 332 333 $prog->{title}->[0]->[0]=$opt_validate_title; 334 $prog->{date}=$opt_validate_year; 335 $imdb->{updateTitles}=0; 336 337 #print Dumper($prog); 338 my $n=$imdb->augmentProgram($prog, $opt_movies_only); 339 if ( $n ) { 340 $Data::Dumper::Sortkeys = 1; # ensure consistent order of dumped hash 341 #my $encoding; 342 #my $w = new XMLTV::Writer((), encoding => $encoding); 343 #$w->start(shift); 344 #$w->write_programme($n); 345 print Dumper($n); 346 #$w->end(); 347 } 348 $imdb->closeMovieIndex(); 349 exit(0); 350} 351 352# test that movie database works okay 353my %w_args = (); 354if (defined $opt_output) { 355 my $fh = new IO::File ">$opt_output"; 356 die "cannot write to $opt_output\n" if not $fh; 357 %w_args = (OUTPUT => $fh); 358} 359 360my $numberOfSeenChannels=0; 361 362my $w; 363my $encoding; # store encoding of input file 364 365sub encoding_cb( $ ) { 366 die if defined $w; 367 $encoding = shift; # callback returns the file's encoding 368 $w = new XMLTV::Writer(%w_args, encoding => $encoding); 369} 370 371sub credits_cb( $ ) { 372 $w->start(shift); 373} 374 375my %seen_ch; 376sub channel_cb( $ ) { 377 my $c = shift; 378 my $id = $c->{id}; 379 $Data::Dumper::Sortkeys = 1; # ensure consistent order of dumped hash 380 if (not defined $seen_ch{$id}) { 381 $w->write_channel($c); 382 $seen_ch{$id} = $c; 383 $numberOfSeenChannels++; 384 } 385 elsif (Dumper($seen_ch{$id}) eq Dumper($c)) { 386 # They're identical, okay. 387 } 388 else { 389 warn "channel $id may differ between two files, " 390 . "picking one arbitrarily\n"; 391 } 392} 393 394sub programme_cb( $ ) { 395 my $prog=shift; 396 397 # The database made by IMDB.pm is read as iso-8859-1. The xml file may be different (e.g. utf-8). 398 # IMDB::augmentProgram does not re-encode the data it adds, so the output file has invalid characters (bug #440). 399 400 my $orig_prog = $prog; 401 if (lc($encoding) ne 'iso-8859-1') { 402 # decode the incoming programme 403 $prog = XMLTV::Data::Recursive::Encode->decode($encoding, $prog); 404 } 405 406 # augmentProgram will now add imdb data as iso-8859-1 407 my $nprog=$imdb->augmentProgram($prog, $opt_movies_only); 408 if ( $nprog ) { 409 if (lc($encoding) ne 'iso-8859-1') { 410 # re-code the modified programme back to original encoding 411 $nprog = XMLTV::Data::Recursive::Encode->encode($encoding, $nprog); 412 } 413 $prog=$nprog; 414 } 415 else { 416 $prog = $orig_prog; 417 } 418 419 # we only add movie information to programmes 420 # that have a 'date' element defined (since we need 421 # a year to work with when verifing we got the correct 422 # hit in the imdb data) 423 $w->write_programme($prog); 424} 425 426@ARGV = ('-') if not @ARGV; 427 428XMLTV::parsefiles_callback(\&encoding_cb, \&credits_cb, 429 \&channel_cb, \&programme_cb, 430 @ARGV); 431# we only get a Writer if the encoding callback gets called 432if ( $w ) { 433 $w->end(); 434} 435 436if ( $opt_stats ) { 437 print STDERR $imdb->getStatsLines($numberOfSeenChannels); 438} 439$imdb->closeMovieIndex(); 440exit(0); 441