웹 크롤링(crawling), 스크래핑(scraping)

2020. 5. 16. 09:45

웹 크롤링(crawling), 스크래핑(scraping)

웹 크롤링(crawling), 스크래핑(scraping)

웹 크롤링과 스크래핑은 웹의 정보를 읽어서 원하는 모양으로 가공하는 기술이다.

두 용어가 서로 비슷하면서 다르게 쓰이는데, 의미는 대동소이하므로 본 강좌에서는 스크래핑으로 통일하여 호칭하도록 한다.

웹 스크래핑을 하기 위해서는 HTML의 기본 적인 구조를 이해해야 한다.

웹 페이지를 구성하는 언어는 HTML(HyperText Markup Language)이며, HTML은 웹 페이지의 정보를 전달하는 기능을 한다.

HTML을 좀 더 다양하게 꾸며주는 언어는 CSS이며, 이를 동적으로 컨트롤 해주는 언어가 Javascript이다.

따라서 웹 페이지의 내용을 읽어와서 정보를 가공하기 위해서는 기본적으로 HTML을 어느정도 이해해야 한다.

HTML에 대한 내용은 인터넷에 나와 있는 많은 강좌를 통해서 알아보도록 하고, 본문에서는 파이썬을 이용한 스크래핑에 대한 내용만 다루도록 한다.

파이썬에서 스크래핑을 하기 위해서는 두 가지의 모듈을 설치해야 한다.

1. requests

2. beautifulsoup4

request는 웹 페이지의 주소를 통하여 HTML을 읽어오는 역할을 한다.

beautifulsoup4는 HTML을 파싱하여 구조화 시켜주는 역할을 한다. HTML의 태그 구조를 트리로 구성하는 역할.

pip install requests beautifulsoup4

soupsieve : https://pypi.org/project/soupsieve/

requests : https://pypi.org/project/requests/

chardet : https://pypi.org/project/chardet/

certifi : https://pypi.org/project/certifi/

urllib3 : https://pypi.org/project/urllib3/

idna : https://pypi.org/project/idna/

1. 웹 문서 전체 가져오기

requests를 이용하여 원하는 주소로부터 웹 페이지를 읽어온다. 그리고 나서 beautifulsoup4를 이용하여 Beautiful 객체로 변환한다.

본문에 사용되는 예제 웹 페이지는 아래의 주소를 사용하도록 한다.

https://www.jbmpa.com/test/scraping.html

import requests

from bs4 import BeautifulSoup

result = requests.get("https://www.jbmpa.com/test/scraping.html")

html = result.text

BS = BeautifulSoup(html, "html.parser")

print(BS)

결과

<html>
<head>
<title>Python Scraping Test</title>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="JBMPA" name="application-name"/>
<meta content="The Python Programming Language " name="msapplication-tooltip"/>
<meta content="jbmpa.com" name="apple-mobile-web-app-title"/>
<meta content="yes" name="apple-mobile-web-app-capable"/>
<meta content="black" name="apple-mobile-web-app-status-bar-style"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="True" name="HandheldFriendly"/>
<meta content="telephone=no" name="format-detection"/>
<meta content="on" http-equiv="cleartype"/>
<meta content="false" http-equiv="imagetoolbar"/>
<style>
       img{
           width:16px;
       }
       table{
           width:100%;
       }
       th {
           background-color:#ECECEC;
       }
       td{
           margin:10px;
           padding:10px;
       }
       h2 {
           color:#0079BF;
       }
       #wrapper{
           margin:auto;
           width:900px;
       }
       .excitingNote{
           font-style:italic;
           font-weight:bold;
       }
       table {
       border-collapse: collapse;
       border-spacing: 0;
       }
       .table-striped > tbody > tr:nth-child(2n+1) > td,
       .table-striped > tbody > tr:nth-child(2n+1) > th {
       background-color: #f9f9f9;
       text-align: left;
       }
       .table > tbody > tr:hover > td, .table > tbody > tr:hover > th {
       background-color: #f5f5f5;
       }
       .table thead > tr > th, .table tbody > tr > th, .table tfoot > tr > th, .table thead > tr > td, .table tbody > tr > td, .table tfoot > tr > td {
       border-bottom: 1px solid #ddd;
       line-height: 1.42857;
       padding: 8px;
       vertical-align: top;
       }
       .table thead > tr > th {
       border-bottom: 2px solid #ddd;
       vertical-align: bottom;
       }
   </style>
</head>
<body>
<div id="wrapper">
<h2><b>TIOBE Index for January 2020</b></h2>
<h3>January Headline: Programming Language C awarded Programming Language of the Year 2019</h3>
<p id="title1">
Everybody thought that Python would become TIOBE's programming language of the year for the second consecutive time.
But it is good old language C that wins the award this time with an yearly increase of 2.4%. Runners up are C# (+2.1%), Python (+1.4%) and Swift (+0.6%).
Why is the programming language C still hot? The major drivers behind this trend are the Internet of Things (IoT) and
the vast amount of small intelligent devices that are released nowadays. C excels when it is applied to small devices that are performance-critical.
It is easy to learn and there is a C compiler available for every processor.
Congratulations to C! Other interesting winners of 2019 are Swift (from #15 to #9) and Ruby (from #18 to #11).
Swift is a permanent top 10 player now and Ruby seems to become one soon.
Some languages that were supposed to break through in 2019 didn't: Rust won only 3 positions (from #33 to #30), Kotlin lost 3 positions (from #31 to #35),
Julia lost even 10 positions (from #37 to #47) and TypeScript won just one position (from #49 to #48). Let's see what 2020 has in store for us!
</p>
<p id="title2">The TIOBE Programming Community index is an indicator of the popularity of programming
languages. The index is updated once a month. The ratings are based on the number of
skilled engineers world-wide, courses and third party vendors. Popular search engines such as
Google, Bing, Yahoo!, Wikipedia, Amazon, YouTube and Baidu are used to calculate the ratings.
It is important to note that the TIOBE index is not about the <i>best</i> programming language or the language
in which <i>most lines of code</i> have been written.</p>
<p id="title3">The index can be used to check whether your programming skills are still up to date or to make a
strategic decision about what programming language should be adopted when starting to build a new
software system. The definition of the TIOBE index can be found here.
</p>
<table class="table table-striped" id="top20">
<thead>
<tr>
<th>Jan 2020</th>
<th>Jan 2019</th>
<th>Change</th>
<th>Programming Language</th>
<th>Ratings</th>
<th>Change</th>
</tr>
</thead>
<tbody>
<tr><td>1</td><td>1</td><td></td><td>Java</td><td>16.896%</td><td>-0.01%</td></tr>
<tr><td>2</td><td>2</td><td></td><td>C</td><td>15.773%</td><td>+2.44%</td></tr>
<tr><td>3</td><td>3</td><td></td><td>Python</td><td>9.704%</td><td>+1.41%</td></tr>
<tr><td>4</td><td>4</td><td></td><td>C++</td><td>5.574%</td><td>-2.58%</td></tr>
57

C#5.349%+2.07%
65

Visual Basic .NET5.287%-1.17%
76

JavaScript2.451%-0.85%
<tr><td>8</td><td>8</td><td></td><td>PHP</td><td>2.405%</td><td>-0.28%</td></tr>
915

Swift1.795%+0.61%
109

SQL1.504%-0.77%
1118

Ruby1.063%-0.03%
1217

Delphi/Object Pascal0.997%-0.10%
1310

Objective-C0.929%-0.85%
1416

Go0.900%-0.22%
1514

Assembly language0.877%-0.32%
1620

Visual Basic0.831%-0.20%
1725

D0.825%+0.25%
1812

R0.808%-0.52%
1913

Perl0.746%-0.48%
2011

MATLAB0.737%-0.76%
</tbody>
</table>
<p id="reference">
Reference : TIOBE Index for January 2020

Reference : Web Scraping Lecture

OVITII : ovitii.com

ALLSURVEY : allsurvey.net

</p>
</div>
<hr/>
</body>
</html>

2. 타이틀 가져오기

읽어온 HTML 문서에서 title만 출력한다.

트리구조의 태그는 닷(.)으로 구분한다. 즉, BS객체에서 <head> 태그 하위에 <title> 태그가 있으므로 타이틀을 출력하기 위해서는 아래와 같이 한다.

BS.head.title

import requests

from bs4 import BeautifulSoup

result = requests.get("https://www.jbmpa.com/test/scraping.html")

html = result.text

BS = BeautifulSoup(html, "html.parser")

# 태그 포함 출력

print(BS.head.title)

# 태그 내의 text만 출력

print(BS.head.title.text)

결과

<title>Python Scraping Test</title>
Python Scraping Test

3. 특정 태그의 내용을 모두 가져오기

동일한 태그가 존재할 때, 그 태그에 관한 모든 내용을 가져오는 메서드는 find_all 이다. 이를 응용하여 <meta> 태그의 모든 내용을 가져오려면

BS.head.find_all('meta')

과 같이 작성한다. find_all은 주어진 태그의 이름을 모두 찾아 리스트로 저장하여 반환한다.

import requests

from bs4 import BeautifulSoup

result = requests.get("https://www.jbmpa.com/test/scraping.html")

html = result.text

BS = BeautifulSoup(html, "html.parser")

print(BS.head.find_all('meta'))

결과

[<meta charset="utf-8"/>, <meta content="IE=edge" http-equiv="X-UA-Compatible"/>, <meta content="JBMPA" name="application-name"/>, <meta content="The Python Programming Language " name="msapplication-tooltip"/>, <meta content="jbmpa.com" name="apple-mobile-web-app-title"/>, <meta content="yes" name="apple-mobile-web-app-capable"/>, <meta content="black" name="apple-mobile-web-app-status-bar-style"/>, <meta content="width=device-width, initial-scale=1.0" name="viewport"/>, <meta content="True" name="HandheldFriendly"/>, <meta content="telephone=no" name="format-detection"/>, <meta content="on" http-equiv="cleartype"/>, <meta content="false" http-equiv="imagetoolbar"/>]

4. 특정 속성값 가져오기

하나의 태그 안에는 다양한 속성들이 있다.

속성들을 가져오려면 get 메소드를 사용한다.

content 속성 값을 가져오는 코드는 아래와 같다.

만약 지정한 속성값이 없다면 "None"을 반환한다.

import requests

from bs4 import BeautifulSoup

result = requests.get("https://www.jbmpa.com/test/scraping.html")

html = result.text

BS = BeautifulSoup(html, "html.parser")

for resultMeta in BS.head.find_all('meta'):

print(resultMeta.get('content'))

결과

None
IE=edge
JBMPA
The Python Programming Language
jbmpa.com
yes
black
width=device-width, initial-scale=1.0
True
telephone=no
on
false

5. 원하는 태그의 내용 찾기

특정 내용이 포함된 태그의 값을 가져오는 메서드는 find 이다.

head안의 meta 태그 중에서 name이 "application-name" 인 태그를 찾는 코드는 아래와 같다.

BS.head.find("meta", {"name":"application-name"})

또는 id, class, text 등을 검색하여 아래와 같은 방법으로 find를 사용할 수 있다.

BS.find(id="reference")

import requests

from bs4 import BeautifulSoup

result = requests.get("https://www.jbmpa.com/test/scraping.html")

html = result.text

BS = BeautifulSoup(html, "html.parser")

print(BS.head.find("meta", {"name":"application-name"}))

print(BS.find(id="reference"))

결과

<p id="reference">
Reference : TIOBE Index for January 2020

Reference : Web Scraping Lecture

OVITII : ovitii.com

ALLSURVEY : allsurvey.net

</p>

find로 찾은 태그안에서 content 속성을 추출하려면 찾은 태그 뒤에 get 메서드를 사용하여 속성값을 추출한다.

import requests

from bs4 import BeautifulSoup

result = requests.get("https://www.jbmpa.com/test/scraping.html")

html = result.text

BS = BeautifulSoup(html, "html.parser")

print(BS.head.find("meta", {"name":"application-name"}))

print(BS.head.find("meta", {"name":"application-name"}).get('content'))

print(BS.find("p", {"id":"title1"}))
print(BS.find("p", {"id":"title1"}).text)

결과

JBMPA

<p id="title1">
Everybody thought that Python would become TIOBE's programming language of the year for the second consecutive time.
......

Julia lost even 10 positions (from #37 to #47) and TypeScript won just one position (from #49 to #48). Let's see what 2020 has in store for us!
</p>

Everybody thought that Python would become TIOBE's programming language of the year for the second consecutive time.
......
Julia lost even 10 positions (from #37 to #47) and TypeScript won just one position (from #49 to #48). Let's see what 2020 has in store for us!

검색한 객체 부터 다음을 검색하고 싶을 때는 find_next 메서드를 사용한다.

예)

title = BS.find(id="title1")

content = title.find_next(class="bbok").text

6. 링크 걸린 태그의 텍스트와 주소 찾기

HTML에서 링크를 거는 태그는 <a> 태그이다. <a> 태그에서 웹 경로를 표시하는 속성은 "href"이다.

따라서 링크 걸린 태그를 모두 찾아서 경로를 추출하는 코드는 아래와 같다.

import requests

from bs4 import BeautifulSoup

result = requests.get("https://www.jbmpa.com/test/scraping.html")

html = result.text

BS = BeautifulSoup(html, "html.parser")

for resultLink in BS.find_all('a'):
print(resultLink.text.strip())
print(resultLink.get('href'))

print("\n###########################\n")

for resultLink in BS.find("p", {"id":"reference"}).find_all('a'):
print(resultLink.text.strip())
print(resultLink.get('href'))

결과

here
https://www.tiobe.com/tiobe-index/programming-languages-definition/
TIOBE Index for January 2020
https://www.tiobe.com/tiobe-index/
Web Scraping Lecture
https://www.jbmpa.com/python_advanced/1
ovitii.com
https://www.ovitii.com/
allsurvey.net
https://www.allsurvey.net/

###########################

TIOBE Index for January 2020
https://www.tiobe.com/tiobe-index/
Web Scraping Lecture
https://www.jbmpa.com/python_advanced/1
ovitii.com
https://www.ovitii.com/
allsurvey.net
https://www.allsurvey.net/

7. select

find_all은 해당하는 값을 모두 찾아 하나의 리스트에 넣어준다. 따라서 find_all로 찾은 결과는 다른 변수에 넣거나, for문등을 이용하여 사용해야 한다. select도 find_all과 같은 기능을 한다.

반면에 select는 해당하는 값을 찾음과 동시에 인덱스를 사용하여 특정 요소만을 선택할 수 있다.

import requests

from bs4 import BeautifulSoup

result = requests.get("https://www.jbmpa.com/test/scraping.html")

html = result.text

BS = BeautifulSoup(html, "html.parser")

for resultMeta in BS.head.find_all('meta'):

print(resultMeta.get('content'))

print("\n###########################\n")

for resultMeta in BS.head.select('meta'):

print(resultMeta.get('content'))

se = BS.head.select('meta')[1]

print(se)

print(se.get('content'))

결과

None
IE=edge
JBMPA
The Python Programming Language
jbmpa.com
yes
black
width=device-width, initial-scale=1.0
True
telephone=no
on
false

###########################

None
IE=edge
JBMPA
The Python Programming Language
jbmpa.com
yes
black
width=device-width, initial-scale=1.0
True
telephone=no
on
false

IE=edge

사용 용도에 따라서 find(find_all), select_once(select)사용하면 된다.

다만 select의 결과를 인덱싱할때, 없는 인덱스를 선택하면 list index out of range가 발생하니, 이에 대한 오류 대처를 해줘야 한다.

8. 연습문제

"예제 사이트에서 2020년도의 인기 있는 Programming Language와 Ratings를 추출하라."

import requests

from bs4 import BeautifulSoup

result = requests.get("https://www.jbmpa.com/test/scraping.html")

html = result.text

BS = BeautifulSoup(html, "html.parser")

# 웹 페이지의 소스를 분석하여 tbody 안의 HTML을 추출한다.

Tbody = BS.table.tbody

# Tbody안에서 <tr> 태그안의 내용을 찾는다.
Tr = Tbody.find_all("tr")

# 리스트로 저장된 Tr을 반복실행하고, <tr>내부의 <td>태그중 원하는 값이 있는 것만 추출한다.

for tdresult in Tr:
pl = tdresult.select('td')[3].text
ratings = tdresult.select('td')[4].text
print(pl, ",", ratings)

결과

Java , 16.896%
C , 15.773%
Python , 9.704%
C++ , 5.574%
C# , 5.349%
Visual Basic .NET , 5.287%
JavaScript , 2.451%
PHP , 2.405%
Swift , 1.795%
SQL , 1.504%
Ruby , 1.063%
Delphi/Object Pascal , 0.997%
Objective-C , 0.929%
Go , 0.900%
Assembly language , 0.877%
Visual Basic , 0.831%
D , 0.825%
R , 0.808%
Perl , 0.746%
MATLAB , 0.737%

스크래핑 연습을 위한 주식 사이트

* Finviz http://finviz.com/

* 국내 데이터 http://comp.fnguide.com
* 글로벌 데이터 https://www.investing.com

Beautiful Soup Documentation : https://www.crummy.com/software/BeautifulSoup/bs4/doc/

'파이썬 응용' 카테고리의 다른 글

파일 이름 일괄 변경 프로그램 만들기 2 (0)	2020.05.16
파일 이름 일괄 변경 프로그램 만들기 (0)	2020.05.16
이미지 EXIF 정보 얻기, GPS 정보 얻기 (0)	2020.05.16
파이썬 엑셀(Excel) 파일 다루기 (0)	2020.05.16
- request와 Beautifulsoup를 이용한 USPS 택배 조회 (0)	2020.05.16

JBM 프로그래밍