root/lang/perl/WWW-Mixi-Scraper/trunk/lib/WWW/Mixi/Scraper.pm @ 18974

Revision 18974, 6.0 kB (checked in by charsbar, 6 years ago)

WWW-Mixi-Scraper: fixed pod and updated manifest and 0.19 -> CPAN

  • Property svn:eol-style set to native
  • Property svn:keywords set to Id
Line 
1package WWW::Mixi::Scraper;
2
3use strict;
4use warnings;
5
6our $VERSION = '0.19';
7
8use String::CamelCase qw( decamelize );
9use Module::Pluggable::Fast
10  name   => 'plugins',
11  search => [qw( WWW::Mixi::Scraper::Plugin )];
12
13use WWW::Mixi::Scraper::Mech;
14use WWW::Mixi::Scraper::Utils qw( _uri );
15
16sub new {
17  my ($class, %options) = @_;
18
19  my $mode = delete $options{mode};
20     $mode = ( $mode && uc $mode eq 'TEXT' ) ? 'TEXT' : 'HTML';
21
22  my $mech = WWW::Mixi::Scraper::Mech->new(%options);
23
24  my $self = bless { mech => $mech }, $class;
25
26  no strict   'refs';
27  no warnings 'redefine';
28  foreach my $plugin ( $class->plugins( mech => $mech, mode => $mode ) ) {
29    my ($name) = decamelize(ref $plugin) =~ /(\w+)$/;
30    $self->{$name} = $plugin;
31    *{"$class\::$name"} = sub { shift->{$name} };
32  }
33
34  $self;
35}
36
37sub parse {
38  my ($self, $uri, %options) = @_;
39
40  $uri = _uri($uri) unless ref $uri eq 'URI';
41
42  my $path = $uri->path;
43  $path =~ s|^/||;
44  $path =~ s|\.pl$||;
45
46  unless ( $self->can($path) ) {
47    warn "You don't have a proper plugin to handle $path";
48    return;
49  }
50
51  foreach my $key ( $uri->query_param ) {
52    next if exists $options{$key};
53    $options{$key} = $uri->query_param($key);
54  }
55  $self->$path->parse( %options ) if $self->can($path);
56}
57
581;
59
60__END__
61
62=head1 NAME
63
64WWW::Mixi::Scraper - yet another mixi scraper
65
66=head1 SYNOPSIS
67
68    use WWW::Mixi::Scraper;
69    my $mixi = WWW::Mixi::Scraper->new(
70      email => 'foo@bar.com', password => 'password',
71      mode  => 'TEXT'
72    );
73
74    my @list = $mixi->parse('http://mixi.jp/new_friend_diary.pl');
75    my @list = $mixi->new_friend_diary->parse;
76
77    my @list = $mixi->parse('http://mixi.jp/new_bbs.pl?page=2');
78    my @list = $mixi->new_bbs->parse( page => 2 );
79
80    my $diary = $mixi->parse('/view_diary.pl?id=0&owner_id=0');
81    my $diary = $mixi->view_diary->parse( id => 0, owner_id => 0 );
82
83    my @comments = @{ $diary->{comments} };
84
85    # for testing
86    my $html = read_file('/some/where/mixi.html');
87    my $diary = $mixi->parse('/view_diary.pl', html => $html );
88    my $diary = $mixi->view_diary->parse( html => $html );
89
90=head1 DESCRIPTION
91
92This is yet another 'mixi' (the largest SNS in Japan) scraper, powered by Web::Scraper. Though APIs are different and incompatible with precedent WWW::Mixi, I'm loosely trying to keep correspondent return values look similar as of writing this (this may change in the future).
93
94WWW::Mixi::Scraper is also pluggable, so if you want to scrape something it can't handle now, add your WWW::Mixi::Scraper::Plugin::<PLfileBasenameInCamel>, and it'll work for you.
95
96=head1 DIFFERENCES BETWEEN TWO
97
98WWW::Mixi has much longer history and is full-stack. The data it returns tended to be more complete, fine-tuned, and raw in many ways (including encoding). However, it tended to suffer from minor html changes as it heavily relies on regexes, and as of writing this (July 2008), it's been broken for months due to a major cosmetic change of mixi in October, 2007.
99
100In contrast, WWW::Mixi::Scraper hopefully tends to survive minor html changes as it relies on XPath/CSS selectors. And basically it uses decoded perl strings, not octets. It's smaller, and pluggable. However, its data is more or less pre-processed and tends to lose some aspects such as proper line breaks. Also, it may be easier to be polluted with garbages. And it may be harder to understand and maintain scraping rules.
101
102Anyway, though a bit limited, ::Scraper is the only practical option right now.
103
104=head1 IF YOU WANT MORE
105
106If you want more features, please send me a patch, or, preferably, commit a patch to the L<coderepos repository|http://coderepos.org/share/>. Just telling me where you want to scrape would be ok but it may take a longer time to implement especially when it's new or less popular and I don't have enough samples.
107
108=head1 ON Plagger::Plugin::CustomFeed::MixiScraper
109
110Usually you want to use this with L<Plagger>, but unfortunately, current CPAN version of Plagger (0.7.17) doesn't have the above plugin. You can always get the latest version of the plugin from L<Plagger's official repository|http://svn.bulknews.net/repos/plagger/trunk/plagger/lib/Plagger/Plugin/CustomFeed/MixiScraper.pm>. See L<Plagger's official site|http://plagger.org/> for instructions to update your Plagger and install extra plugins.
111
112=head1 METHODS
113
114=head2 new
115
116creates an object. You can pass an optional hash. Important keys are:
117
118=over 4
119
120=item email, password
121
122the ones you use to login.
123
124=item mode
125
126WWW::Mixi::Scraper has changed its policy since 0.08, and now it returns raw HTML for some of the longer texts like user's profile or diary body by default. However, this may cause various kind of problems. If you don't like HTML output, set this 'mode' option to 'TEXT', then it returns pre-processed texts as before.
127
128=item cookie_jar
129
130would be passed to WWW::Mechanize. If your cookie_jar has valid cookies for mixi, you don't need to write your email/password in your scripts.
131
132=back
133
134Other options would be passed to Mech, too.
135
136=head2 parse
137
138takes a uri and returns scraped data, which is mainly an array, sometimes a hash reference, or possibly something else, according to the plugin that does actual scraping. You can pass an optional hash, which eventually override query parameters of the uri. An exception is 'html', which wouldn't go into the uri but provide raw html string to the scraper (mainly to test).
139
140=head1 TO DO
141
142More scraper plugins, various levels of caching, password obfuscation, some getters of minor information such as pager, counter, and image/thumbnail sources, and maybe more docs?
143
144Anyway, as this is a 'scraper', I don't include 'post' related methods here. If you insist, use WWW::Mechanize object hidden in the stash, or WWW::Mixi.
145
146=head1 SEE ALSO
147
148L<WWW::Mixi>, L<Web::Scraper>, L<WWW::Mechanize>, L<Plagger>
149
150=head1 AUTHOR
151
152Kenichi Ishigaki, E<lt>ishigaki at cpan.orgE<gt>
153
154=head1 COPYRIGHT AND LICENSE
155
156Copyright (C) 2007 by Kenichi Ishigaki.
157
158This program is free software; you can redistribute it and/or
159modify it under the same terms as Perl itself.
160
161=cut
Note: See TracBrowser for help on using the browser.