Laravel Guzzle Crawler 爬取网页内容

17. July 2018 Laravel 0

①:若为API,返回json数据格式,如下:

$client = new Client(['base_uri' => 'https://www.usnews.com']);
for ($i = 1; $i <= $total_pages; $i++) {
$result = $client->request('get', "/best-colleges/rankings/$classtype?_page=$i&format=json", [
'headers' => [
"Proxy-Connection" => "keep-alive",
"Pragma" => "no-cache",
"Cache-Control" => "no-cache",
"User-Agent" => "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",
"Accept" => "image/x-xbitmap,image/jpeg,application/x-shockwave-flash,application/vnd.ms-excel,application/vnd.ms-powerpoint,application/msword,*/*",
"DNT" => "1",
"Accept-Encoding" => "gzip, deflate, sdch",
"Accept-Language" => "zh-CN,zh;q=0.8,en-US;q=0.7,en;q=0.6",
]
]
);
$school = json_decode($result->getBody());
if ($total_pages == 1) {
$total_pages = $school->data->total_pages;//获取总页数后赋值
}
sleep(3);
}

②若为html页面,获取后使用crawler解析dom节点

$client = new Client(['base_uri' => 'https://www.usnews.com']);
$result = $client->request('get', "/best-colleges/{$college->url_name}-{$college->primary_key}", [
'headers' => [
"Proxy-Connection" => "keep-alive",
"Pragma" => "no-cache",
"Cache-Control" => "no-cache",
"User-Agent" => "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",
"Accept" => "image/x-xbitmap,image/jpeg,application/x-shockwave-flash,application/vnd.ms-excel,application/vnd.ms-powerpoint,application/msword,*/*",
"DNT" => "1",
"Accept-Encoding" => "gzip, deflate, sdch",
"Accept-Language" => "zh-CN,zh;q=0.8,en-US;q=0.7,en;q=0.6",
]
]
);
$content = $result->getBody()->getContents();
$res = $this->usnewsContentArticle($content);

private function usnewsContentArticle($content)//获取节点,先使用审查元素确定元素再使用右键复制css 选择器挺好用。
{
$crawler = new Crawler($content);
$articles = $crawler->filter('#next-stories > div.flex-media')->each(function (Crawler $node, $i) {
$article['title'] = $node->filter('div:nth-child(2)>h3:nth-child(1) > a:nth-child(1)')->count() ? $node->filter('div:nth-child(2)>h3:nth-child(1) > a:nth-child(1)')->text() : '';
$article['href'] = $node->filter('div:nth-child(2) > h3:nth-child(1) > a:nth-child(1)')->count() ? $node->filter('div:nth-child(2) > h3:nth-child(1) > a:nth-child(1)')->attr('href') : '';
$article['auto'] = $node->filter('div:nth-child(2) > div:nth-child(2) > p:nth-child(1) > span:nth-child(1) > a:nth-child(1)')->count() ? $node->filter('div:nth-child(2) > div:nth-child(2) > p:nth-child(1) > span:nth-child(1) > a:nth-child(1)')->text() : '';
$article['time'] = $node->filter('div:nth-child(2) > div:nth-child(2) > p:nth-child(1)')->count() ? $node->filter('div:nth-child(2) > div:nth-child(2) > p:nth-child(1)')->text() : '';
$article['desc'] = $node->filter('div:nth-child(2) > div:nth-child(2) > p:nth-child(2)')->count() ? $node->filter('div:nth-child(2) > div:nth-child(2) > p:nth-child(2)')->text() : '';
$article['img'] = $node->filter('div:nth-child(1) > a > picture > source:nth-child(1)')->count() ? $node->filter('div:nth-child(1) > a > picture > source:nth-child(1)')->attr('srcset') : '';
$article['content'] = $this->usnewsArticleContent($article['href']);
sleep(3);
return $article;
});
return $articles;
}

③若使用laravel下载网络图片

private function downloadImg($url)
{
try {
$client = new Client();
$file = Uuid::uuid1()->getHex();
$extentArray = explode('.', $url);
$extent = array_pop($extentArray) ?? 'jpg';
$filename = $file . "." . $extent;
$res = $client->get($url)->getBody()->getContents();
Storage::put("cover/$filename", $res);
return 'cover/' . $filename;
} catch (RequestException $exception) {
return "";
}
}

④耗时任务使用http或https连接不太好,这种爬虫任务最好使用定时任务,有php-cgi处理

在app/Console/Commands/Kernel.php 中,使用闭包或command

protected function schedule(Schedule $schedule)
{
$schedule->call(function (CheckCollegeController $check){
$check->downloadImg()>appendOutputTo('/home/wwwlogs/schedule.log');
});
}
在terminal运行 php artisan schedule:run