사이트 긁어오기...(Snoopy.class 와 Client URL Library)

가끔씩 다른 사이트 게시판 내용을 긁어 와야 할때가 있다.

PHP에서는 Snoopy.class와 Curl이라는 라이브러리고 긁어 올수 있다.

일단 긁어오기전에 라이브러리를 다운 받아야 한다.

1. Snoopy.class 사용.

http://sourceforge.net/projects/snoopy/ 에 가면 다운 받을 수 있다.

그리고 난 후 적당한 폴더에 소스를 넣는다. 그리고 아래와 같이 페이지 하나 만든후 코딩하면

내용을 긁어 올수 있다. (예 : naver)

<?php

include "./includes/Snoopy.class.php";

$today = date("Y-m-d");

$snpy = new Snoopy;

$snpy ->agent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)";

$snpy->referer = "http://www.happyourlife.com";

// set some cookies:

$snpy->cookies["SessionID"] = '238472834723489';

$snpy->cookies["favoriteColor"] = "blue";

// set an raw-header:

$snpy->rawheaders["Pragma"] = "no-cache";

// set some internal variables:

$snpy->maxredirs = 2;

$snpy->offsiteok = false;

$snpy->expandlinks = false;

// set username and password (optional)

//$snoopy->user = "joe";

//$snoopy->pass = "bloe";

// fetch the text of the website www.google.com:

if($snpy->fetchtext("http://www.naver.com")){

// other methods: fetch, fetchform, fetchlinks, submittext and submitlinks

// response code:

print "response code: ".$snpy->response_code."<br/>\n";

// print the headers:

print "<b>Headers:</b><br/>";

while(list($key,$val) = each($snpy->headers)){

print $key.": ".$val."<br/>\n";

}

print "<br/>\n";

// print the texts of the website:

//print "<pre>".htmlspecialchars($snoopy->results)."</pre>\n";

print "<pre>".$snpy->results."</pre>\n";

}

else {

print "Snoopy: error while fetching document: ".$snpy->error."\n";

}

결과..

소스에 보면 print로 result값을 찍었다.

이 다음 부터는 원하는 값을 뽑거나 가져오기 위해서는 개발자들이 정규식을 사용을 하던 치환을 하던

다름 값을 넣어 자르던 변형해서 값을 뽑아 와야 한다.

좀 더 자세한 함수명이나 내용은 Snoopy.class의 document는 아래 링크를 참조하면 된다.

http://www.tig12.net/downloads/apidocs/wp/wp-includes/Snoopy.class.html#det_fields_referer

2. Client URL Library 사용하기

이것도 마찬가지다. PHP 라이브러리를 이용해서 긁어온다.

기본 소스는 아래와 같이 하면 된다.

$url = 'https://www.naver.com;

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<head>

</head>

<body>

$curl = curl_init();

curl_setopt($curl, CURLOPT_URL, $url);

curl_setopt($curl, CURLOPT_HTTPAUTH, CURLAUTH_BASIC);

curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, FALSE); //가끔가다가 인증서버에 대한 옵션이 있는데 믿을만 하다면 FALSE설정해도 됨

curl_setopt($curl, CURLOPT_USERPWD, "vanchosun:van1158");

curl_setopt($curl, CURLOPT_CONNECTTIMEOUT,5);

curl_setopt($curl, CURLOPT_RETURNTRANSFER,1);

curl_setopt($curl, CURLOPT_SSL_VERIFYPEER,0);

$result = curl_exec ($curl);

echo $result;

curl_close ($curl);

</body>

</html>

결과

여기도 마찬가지로 result값을 변형해서(어떻게 잘라서 내가 원하는 값을 넣을까에 대한 고민이 크다) 원하는 값을 뽑아 내야 한다.

그리고 cron으로 등록해서 주기적으로 돌려주면 알아서 신규 글을 긁어 올수 있다.

만약 HTTP Basic Authorization을 사용 해야 한다면 id/pw값을 넣어서 넘기면 자동 로그인후 긁어 올수 있다

저작자표시

'프로그래밍 > PHP' 카테고리의 다른 글

아파치 가상호스트 설정하기 (0)	2013.06.26
PHP Fatal error: Allowed memory size of 8388608 bytes exhausted (tried to allocate 71 bytes) in ..... (0)	2013.06.21
php on ubunto 환경설정 문서 (0)	2013.06.08
Client URL Library (cURL) (0)	2013.02.26

공부하는 인간

사이트 긁어오기...(Snoopy.class 와 Client URL Library)

'프로그래밍 > PHP' 카테고리의 다른 글

티스토리툴바

사이트 긁어오기...(Snoopy.class 와 Client URL Library)

'프로그래밍 > PHP' 카테고리의 다른 글

'프로그래밍/PHP' Related Articles

티스토리툴바