crawling and scraping

python

crawling and scraping

ro_ot ㅣ 2020. 2. 20. 16:39

* 크롤링(Crawling)

크롤링(Crawling)이란 프로그램이 웹 사이트를 정기적으로 돌며 정보를 추출하는 기술입니다.

크롤링하는 프로그램을 “크롤러(Crawler)” 또는 “스파이더(Spider)”라고 합니다.

예를 들어, 검색엔짂을 구현할 때 사용하는 크롤러는 웹 사이트의 링크를 타고 웹 사이트를 돌아다닙니다. 그리고 웹 사이트의 데이터를 모아서 데이터베이스에 저장합니다.

* 스크레이핑(Scraping)
스크레이핑(Scraping)이란 웹 사이트에 있는 특정 정보를 추출하는 기술을 의미합니다.

스크레이핑을 이용하면 웹 사이트에 있는 정보를 쉽게 수집할 수 있습니다.

웹에 공개된 정보는 대부분 HTML 형식입니다. 이를 가져와서 데이터베이스에 저장하려면 데이터 가공이 필요합니다. 광고 등의 불필요한 정보를 제거하고, 필요한 정보만 가져오려면 사이트 구조를 분석해야 합니다.

따라서 스크레이핑이라는 기술은 웹에서 데이터를 추출하는 것뿐만 아니라 그러한 구조를 분석하는 것도 포함됩니다.

또한 최근에는 로그인을 해야만 유용한 정보에 접근할 수 있는 사이트도 많습니다.
이 경우 단순히 URL을 알고 있는 것만으로는 유용한 정보에 접근할 수 없습니다.
따라서 제대로 스크레이핑을 하려면 로그인해서 필요한 웹 페이지에 접근하는 기술도 알아야 합니다.

* 웹상의 정보를 추출하는 방법
1. 파이썬은 웹 사이트에 있는 데이터를 추출하기 위해 “urllib” 라이브러리를 사용합니다.

이 라이브러리를 이용하면 HTTP 또는 FTP를 사용해 데이터를 다운로드할 수 있습니다.

2. urllib 라이브러리는 URL을 다루는 모듈을 모아 놓은 패키지입니다.

3. 그중에서도 urllib.request 모듈은 웹 사이트에 있는 데이터에 접근하는 기능을 제공합니다.

또한 인증, 리다이렉트, 쿠키(Cookie)처럼 인터넷을 이용한 다양한 요청과 처리를 지원합니다.

4. BeautifulSoup 라이브러리를 이용하면 간단하게 HTML과 XML에서 정보를 추출할 수 있습니다.

* urllib.request

urllib.request를 이용한 다운로드 : urlretrieve( ) 함수

파일을 다운로드할 때는 urllib.request 모듈에 있는 urlretrieve( ) 함수를 사용합니다.

이 함수를 사용하면 직접 파일을 다운로드할 수 있습니다.

# download_png1
import urllib.request as request
# URL과 저장 경로 지정하기
# url = "http://uta.pw/shodou/img/28/214.png"
url_1 = "https://t1.daumcdn.net/daumtop_chanel/op/20170315064553027.png"
url_2 = "https://www.naver.com/"
url_3 = "http://www.kma.go.kr/weather/forecast/mid-term-rss3.jsp"

# 다운로드
request.urlretrieve(url_1, 'daum.png')
request.urlretrieve(url_2, 'naver.html')
request.urlretrieve(url_3, 'kma.html')
print("저장되었습니다...!")

urllib.request를 이용한 다운로드 : urlopen( ) 함수

이번에는 파일을 다운로드할 때는 urllib.request 모듈에 있는 urlopen( ) 함수를 이용해 메모리 상에 데이터를 올리고, 그 이후에 파일에 저장해 봅니다.

# download_png2

import urllib.request as req

# URL과 저장 경로 지정하기
url = "http://uta.pw/shodou/img/28/214.png"
savename = "test2.png"

# 다운로드
mem = req.urlopen(url).read()

# 파일로 저장하기 ( w:쓰기모드, b:바이너리 모드 )
with open(savename, mode="wb") as f:
    f.write(mem)
print("저장되었습니다...!")

* 기상청의 RSS 서비스

# download_forecast

import urllib.request
import urllib.parse

API = "http://www.kma.go.kr/weather/forecast/mid-term-rss3.jsp?stnId=108"
# 매개변수를 URL 인코딩
# values = {
# 'stnId': '108' # 전국
# }
# params = urllib.parse.urlencode(values)
# 요청 젂용 URL을 생성
# url = API + "?" + params
# url = API + "?stnId=108"
# print("url=", url)
# xml파일을 읽어와서 출력함
data = urllib.request.urlopen(API).read()
text = data.decode("utf-8")
print(text)
# forecast.xml 파일로 저장하기 ( w: 쓰기모드 )
with open("forecast.xml", mode="w", encoding="utf-8") as f:
    f.write(text)
    print("저장되었습니다...!")

* xml파일 읽기

# xml_forecast

from bs4 import BeautifulSoup
import urllib.request as req
import os.path

url = "http://www.kma.go.kr/weather/forecast/mid-termrss3.jsp?stnId=108"
savename = "forecast.xml"
if not os.path.exists(savename):
    req.urlretrieve(url, savename)

# BeautifulSoup로 분석하기
xml = open(savename, "r", encoding="utf-8").read()
soup = BeautifulSoup(xml, 'html.parser')

# 각 지역 확인하기
info = {}
for location in soup.find_all("location"):
    name = location.find('city').string  # 도시명
    wf = location.find('wf').string      # 날씨
    tmx = location.find('tmx').string    # 최고기온
    tmn = location.find('tmn').string    # 최저기온
    weather = wf + ':' + tmn + ' ~ ' + tmx
    if not (name in info):
        info[name] = []
    info[name].append(weather)  # info = { name : weather }

# 각 지역의 날씨를 구분해서 출력하기
for name in info.keys():
    print("+", name)
    for weather in info[name]:
        print("| - ", weather)

* BeautifulSoup 모듈로 스크레이핑 하기

1. 스크레이핑이란 웹 사이트에서 데이터를 추출하고, 원하는 정보를 추출하는 것입니다.

최근에는 인터넷에 데이터가 너무 많으므로 스크레이핑을 잘 활용하는 것이 중요합니다.

2. 파이썬으로 스크레이핑 할 때 빼놓을 수 없는 라이브러리가 바로 “ BeautifulSoup "입니다.

이 라이브러리를 이용하면 간단하게 HTML과 XML에서 정보를 추출할 수 있습니다.

3. 최근 스크레이핑 라이브러리는 다운로드부터 HTML 분석까지 모두 해주는 경우가 많은데, BeautifulSoup 라이브러리는 어디까지나 HTML과 XML을 분석해주는 라이브러리입니다.

BeautifulSoup 자체에는 다운로드 기능이 없습니다.

* BeautifulSoup 모듈 사용법
find(‘태그명’) : 특정 태그 1개만 추출하는 역할
find(‘태그명’).string : 특정 태그 안에 있는 텍스트를 추출하는 역할
find(‘태그명’).text : 특정 태그 안에 있는 텍스트를 추출하는 역할
find(‘태그명’).get_text() : 특정 태그 안에 있는 텍스트를 추출하는 역할

find(‘태그명’ , {‘class’ : ‘class명‘} ) : class 값을 이용해서 추출
find(‘태그명’ , {‘id’ : ‘id명‘} ) : id명을 이용해서 특정 태그를 구함
find( id = ‘id명’ ) : id명을 이용해서 특정 태그를 구함

findAll(‘태그명‘) : 지정된 모든 태그를 리스트 형태로 추출하는 역할
find_all(‘태그명‘) : 지정된 모든 태그를 리스트 형태로 추출하는 역할

* BeautifulSoup 기본 예제

# bs_find01

from bs4 import BeautifulSoup

html_str = '<html><div>hello</div></html>'
soup = BeautifulSoup(html_str, "html.parser")

print(soup)                     # <html><div>hello</div></html>
print(soup.find("div"))         # <div>hello</div>
print(soup.find("div").text)    # hello

# bs_find02

from bs4 import BeautifulSoup

html_str = """
<html>
 <body>
 <ul>
 <li>hello</li>
 <li>bye</li>
 <li>welcome</li>
 </ul>
 </body>
</html>
"""
bs_obj = BeautifulSoup(html_str, "html.parser")
print(bs_obj)
print()

ul = bs_obj.find("ul")  # ul 태그 안에 있는 모든 내용을 추출함
print(ul)
print()

li = ul.find('li')      # 첫번째 li 태그를 추출함
print(li)               # <li>hello</li>
print(li.text)          # hello

# bs_findall01

from bs4 import BeautifulSoup

html_str = """
<html>
    <body>
        <ul>
            <li>hello</li>
            <li>bye</li>
            <li>welcome</li>
        </ul>
    </body>
</html>
"""

bs_obj = BeautifulSoup(html_str, "html.parser")

ul = bs_obj.find("ul")      # 첫번째 ul 태그를 구해옴
list = ul.findAll("li")     # 모든 li 태그를 구해옴
print(type(list))           # <class 'bs4.element.ResultSet'>
print(list)                 # [<li>hello</li>, <li>bye</li>, <li>welcome</li>]
for i in list:              # text를 사용하면 <li> 태그 안에있는 값만 출력
    print(i.text)           # <li>hello</li>
                            # <li>bye</li>
                            # <li>welcome</li>

* BeautifulSoup 기본 예제 : class속성으로 데이터 접근하기

from bs4 import BeautifulSoup

html_str = """
<html>
    <body>
        <ul class="greet">          # class 값을 다르게 해야 원하는 태그를 구해올 수 있다.
            <li>hello</li>
            <li>bye</li>
            <li>welcome</li>
        </ul>
        <ul class="reply">
            <li>ok</li>
            <li>no</li>
            <li>sure</li>
        </ul>
    </body>
</html>
"""
bs_obj = BeautifulSoup(html_str, "html.parser")
ul_reply = bs_obj.find("ul", {"class": "reply"})  # ul 태그에서 class 속성값이 reply 데이터 추출
list = ul_reply.findAll("li")                     # 모든 li 태그를 추출
for li in list:
    print(li.text)                                # ok no sure 출력

* 속성 값 추출하기

# bs_href

from bs4 import BeautifulSoup
html_str = """
<html>
     <body>
         <ul class="ko">
             <li>
                 <a href="https://www.naver.com/">네이버</a>
             </li>
             <li>
                 <a href="https://www.daum.net/">다음</a>
             </li>
         </ul>
         <ul class="sns">
             <li>
                 <a href="https://www.google.com/">구글</a>
             </li>
             <li>
                 <a href="https://www.facebook.com/">페이스북</a>
             </li>
         </ul>
     </body>
</html>
"""

bs_obj = BeautifulSoup(html_str, "html.parser")
atag = bs_obj.find("a")         # 첫번째 anchor 태그를 구해옴
print(atag)                     # <a href="https://www.naver.com/">네이버</a>
print(atag['href'])             # https://www.naver.com/

* BeautifulSoup 기본 사용법 : id로 요소를 찾는 방법 – find( ) 함수

BeautifulSoup는 루트부터 하나하나 요소를 찾는 방법 말고도 id속성을 지정해서 요소를 찾는 find() 메소드를 제공합니다.

# bs_test2

from bs4 import BeautifulSoup
html_str = """
<html>
     <body>
         <ul class="ko">
             <li>
                 <a href="https://www.naver.com/">네이버</a>
             </li>
             <li>
                 <a href="https://www.daum.net/">다음</a>
             </li>
         </ul>
         <ul class="sns">
             <li>
                 <a href="https://www.google.com/">구글</a>
             </li>
             <li>
                 <a href="https://www.facebook.com/">페이스북</a>
             </li>
         </ul>
     </body>
</html>
"""

bs_obj = BeautifulSoup(html_str, "html.parser")
atag = bs_obj.find("a")         # 첫번째 anchor 태그를 구해옴
print(atag)                     # <a href="https://www.naver.com/">네이버</a>
print(atag['href'])             # https://www.naver.com/

* BeautifulSoup 기본 사용법 : id로 요소를 찾는 방법

# bs_test3
from bs4 import BeautifulSoup
html = """
<html>
    <head>
        <title>작품과 작가 모음</title>
    </head>
    <body>
        <h1>책 정보</h1>
        <p id="book_title">토지</p>
        <p id="author">박경리</p>
        
        <p id="book_title">태백산맥</p>
        <p id="author">조정래</p>
        
        <p id="book_title">감옥으로부터의 사색</p>
        <p id="author">신영복</p>
    </body>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')

# id 값이 book_title 인 첫번째 p태그를 구해옴
title = soup.find('p', {"id":"book_title"}).text
print('title:', title) # title: 토지

# id값이 author 인 첫번째 p태그를 구해옴
author = soup.find('p', {"id":"author"}).text
print('author:', author) # author: 박경리

# id 값이 book_title 인 모든 p태그를 구해와서 리스트로 리턴
title2 = soup.find_all('p', {"id":"book_title"})
print('title2:', title2)

for t2 in title2:
 print(t2.text)

# id값이 author 인 모든 p태그를 구해와서 리스트로 리턴
author2 = soup.find_all('p', {"id":"author"})
print('author2:', author2)

for a2 in author2:
 print(a2.text)

'python' 카테고리의 다른 글

Python 변수 (0)	2020.09.29
python 가상환경(pyenv) (0)	2020.06.12
웹 API (0)	2020.02.21
google API(지도활용) (0)	2020.02.20
wordcloud (0)	2020.02.18

정착소

crawling and scraping

'python' 카테고리의 다른 글

티스토리툴바