Laravel Guzzle Crawler 爬取网页内容
①:若为API,返回json数据格式,如下:
$client = new Client(['base_uri' => 'https://www.usnews.com']); for ($i = 1; $i <= $total_pages; $i++) { $result = $client->request('get', "/best-colleges/rankings/$classtype?_page=$i&format=json", [ 'headers' => [ "Proxy-Connection" => "keep-alive", "Pragma" => "no-cache", "Cache-Control" => "no-cache", "User-Agent" => "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36", "Accept" => "image/x-xbitmap,image/jpeg,application/x-shockwave-flash,application/vnd.ms-excel,application/vnd.ms-powerpoint,application/msword,*/*", "DNT" => "1", "Accept-Encoding" => "gzip, deflate, sdch", "Accept-Language" => "zh-CN,zh;q=0.8,en-US;q=0.7,en;q=0.6", ] ] ); $school = json_decode($result->getBody()); if ($total_pages == 1) { $total_pages = $school->data->total_pages;//获取总页数后赋值 } sleep(3); }
②若为html页面,获取后使用crawler解析dom节点
$client = new Client(['base_uri' => 'https://www.usnews.com']); $result = $client->request('get', "/best-colleges/{$college->url_name}-{$college->primary_key}", [ 'headers' => [ "Proxy-Connection" => "keep-alive", "Pragma" => "no-cache", "Cache-Control" => "no-cache", "User-Agent" => "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36", "Accept" => "image/x-xbitmap,image/jpeg,application/x-shockwave-flash,application/vnd.ms-excel,application/vnd.ms-powerpoint,application/msword,*/*", "DNT" => "1", "Accept-Encoding" => "gzip, deflate, sdch", "Accept-Language" => "zh-CN,zh;q=0.8,en-US;q=0.7,en;q=0.6", ] ] ); $content = $result->getBody()->getContents(); $res = $this->usnewsContentArticle($content); private function usnewsContentArticle($content)//获取节点,先使用审查元素确定元素再使用右键复制css 选择器挺好用。 { $crawler = new Crawler($content); $articles = $crawler->filter('#next-stories > div.flex-media')->each(function (Crawler $node, $i) { $article['title'] = $node->filter('div:nth-child(2)>h3:nth-child(1) > a:nth-child(1)')->count() ? $node->filter('div:nth-child(2)>h3:nth-child(1) > a:nth-child(1)')->text() : ''; $article['href'] = $node->filter('div:nth-child(2) > h3:nth-child(1) > a:nth-child(1)')->count() ? $node->filter('div:nth-child(2) > h3:nth-child(1) > a:nth-child(1)')->attr('href') : ''; $article['auto'] = $node->filter('div:nth-child(2) > div:nth-child(2) > p:nth-child(1) > span:nth-child(1) > a:nth-child(1)')->count() ? $node->filter('div:nth-child(2) > div:nth-child(2) > p:nth-child(1) > span:nth-child(1) > a:nth-child(1)')->text() : ''; $article['time'] = $node->filter('div:nth-child(2) > div:nth-child(2) > p:nth-child(1)')->count() ? $node->filter('div:nth-child(2) > div:nth-child(2) > p:nth-child(1)')->text() : ''; $article['desc'] = $node->filter('div:nth-child(2) > div:nth-child(2) > p:nth-child(2)')->count() ? $node->filter('div:nth-child(2) > div:nth-child(2) > p:nth-child(2)')->text() : ''; $article['img'] = $node->filter('div:nth-child(1) > a > picture > source:nth-child(1)')->count() ? $node->filter('div:nth-child(1) > a > picture > source:nth-child(1)')->attr('srcset') : ''; $article['content'] = $this->usnewsArticleContent($article['href']); sleep(3); return $article; }); return $articles; }
③若使用laravel下载网络图片
private function downloadImg($url) { try { $client = new Client(); $file = Uuid::uuid1()->getHex(); $extentArray = explode('.', $url); $extent = array_pop($extentArray) ?? 'jpg'; $filename = $file . "." . $extent; $res = $client->get($url)->getBody()->getContents(); Storage::put("cover/$filename", $res); return 'cover/' . $filename; } catch (RequestException $exception) { return ""; } }
④耗时任务使用http或https连接不太好,这种爬虫任务最好使用定时任务,有php-cgi处理
在app/Console/Commands/Kernel.php 中,使用闭包或command protected function schedule(Schedule $schedule) { $schedule->call(function (CheckCollegeController $check){ $check->downloadImg()>appendOutputTo('/home/wwwlogs/schedule.log'); }); } 在terminal运行 php artisan schedule:run