Python/자동화

(파이썬 웹크롤링) 구글 이미지 크롤링

우선 전체 코드 바로 공개 하겠습니다.

<전체 코드> 아래의 코드 실행시 파일과 같은 위치에서 imgs 폴더를 꼭 생성해주세요!!

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import time
import urllib.request

searchKey = input('검색 키워드 입력:')

driver = webdriver.Chrome()
driver.get("https://www.google.co.kr/imghp?hl=ko&tab=wi&authuser=0&ogbl")
elem = driver.find_element("name", "q")

elem.send_keys(searchKey)
elem.send_keys(Keys.RETURN)

SCROLL_PAUSE_TIME = 1
# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)
    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        try:
            driver.find_element(By.CSS_SELECTOR, ".mye4qd").click()
        except:
            break
    last_height = new_height

images = driver.find_elements(By.CSS_SELECTOR, ".rg_i.Q4LuWd")
count = 1
for image in images:
    try:
        image.click()
        time.sleep(0.5)
        
        imgUrl = driver.find_element(
			By.XPATH,
			'//*[@id="Sva75c"]/div[2]/div[2]/div[2]/div[2]/c-wiz/div/div/div/div[3]/div[1]/a/img[1]'
        ).get_attribute("src")

        opener = urllib.request.build_opener()
        opener.addheaders = [
            ('User-Agent',
             'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1941.0 Safari/537.36')
        ]
        urllib.request.install_opener(opener)
        urllib.request.urlretrieve(imgUrl, f'./imgs/{searchKey}{str(count)}.jpg')
        count = count + 1
    except Exception as e:
        print('e : ', e)
        pass

driver.close()

위 코드는 Python의 Selenium 라이브러리를 사용하여 Google 이미지 검색 결과를 크롤링하는 코드입니다. 검색어를 입력받아 Google 이미지 검색 페이지에 접속하고, 스크롤을 내리면서 이미지들을 가져와서 저장하는 기능을 수행합니다.

코드의 동작을 간단하게 설명하면 다음과 같습니다.

필요한 라이브러리를 import 합니다.
사용자로부터 검색어를 입력받습니다.
Selenium을 이용하여 Chrome 브라우저를 실행하고 Google 이미지 검색 페이지에 접속합니다.
검색어를 입력하고 검색 버튼을 클릭합니다.
스크롤을 내리면서 이미지를 가져옵니다.
이미지를 저장합니다.

코드의 구체적인 동작은 다음과 같습니다.

ssl 라이브러리는 HTTPS 통신에서 인증서 검증을 생략하기 위해 사용합니다. 이 코드에서는 인증서 검증을 생략하도록 설정합니다. 그리고 Selenium 라이브러리와 urllib 라이브러리를 import 합니다.

라이브러리 import

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import time
import urllib.request

검색어 입력받기

사용자로부터 검색어를 입력받습니다.

searchKey = input('검색 키워드 입력:')

Chrome 브라우저 실행 및 Google 이미지 검색 페이지 접속

Selenium을 이용하여 Chrome 브라우저를 실행하고 Google 이미지 검색 페이지에 접속합니다.

driver = webdriver.Chrome()
driver.get("https://www.google.co.kr/imghp?hl=ko&tab=wi&authuser=0&ogbl")

검색어 입력 및 검색 버튼 클릭

검색어를 입력하고 검색 버튼을 클릭합니다.

elem = driver.find_element("name", "q")
elem.send_keys(searchKey)
elem.send_keys(Keys.RETURN)

이미지 스크롤링

SCROLL_PAUSE_TIME = 1
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(SCROLL_PAUSE_TIME)
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        try:
            driver.find_element(By.CSS_SELECTOR, ".mye4qd").click()
        except:
            break
    last_height = new_height

이미지 다운로드

스크롤 다운을 완료한 후, 이미지를 다운로드합니다. 먼저, 이미지 URL을 얻어옵니다. 이후, urllib.request를 이용하여 해당 URL에서 이미지를 다운로드합니다. 브라우저에서 이미지를 클릭해 원본 이미지 주소를 얻어와, urllib 라이브러리를 사용하여 이미지를 다운로드합니다.

마지막으로, 크롬 드라이버를 종료합니다.

이 코드를 실행하면 검색어와 일치하는 이미지를 "./imgs" 폴더에 "검색어숫자.jpg" 형태로 저장합니다.

참고로, 크롬 웹드라이버는 Chrome 브라우저 버전에 맞게 다운로드 받아야 합니다. 이 코드를 실행하기 위해서는 Selenium, urllib 모듈이 설치되어 있어야 합니다. 또한, 이 코드는 구글 이미지 검색 페이지의 HTML 구조나 클래스 이름이 바뀌면 작동하지 않을 수 있습니다.

images = driver.find_elements(By.CSS_SELECTOR, ".rg_i.Q4LuWd")
count = 1
for image in images:
    try:
        image.click()
        time.sleep(0.5)
        imgUrl = driver.find_element(
            By.XPATH,
            '//*[@id="Sva75c"]/div[2]/div/div[2]/div[2]/div[2]/c-wiz/div/div[1]/div[2]/div[2]/div/a/img'
        ).get_attribute("src")
        opener = urllib.request.build_opener()
        opener.addheaders = [
            ('User-Agent',
             'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1941.0 Safari/537.36')
        ]
        urllib.request.install_opener(opener)
        urllib.request.urlretrieve(imgUrl, f'./imgs/{searchKey}{str(count)}.jpg')
        count = count + 1
    except Exception as e:
        print('e : ', e)
        pass

driver.close()

사용된 패키지 리스트 입니다.

Package            Version
------------------ ---------
async-generator    1.10
attrs              22.2.0
beautifulsoup4     4.11.2
certifi            2022.12.7
charset-normalizer 3.0.1
exceptiongroup     1.1.0
h11                0.14.0
idna               3.4
outcome            1.2.0
packaging          23.0
pip                22.1
PySocks            1.7.1
python-dotenv      0.21.1
requests           2.28.2
selenium           4.8.0
setuptools         62.2.0
sniffio            1.3.0
sortedcontainers   2.4.0
soupsieve          2.4
tqdm               4.64.1
trio               0.22.0
trio-websocket     0.9.2
urllib3            1.26.14
webdriver-manager  3.8.5
wheel              0.37.1
wsproto            1.2.0

'Python > 자동화' 카테고리의 다른 글

(파이썬 웹크롤링) 네이버 뉴스 크롤링 (0)	2024.03.21
(파이썬 웹크롤링) 이미지 창고의 이미지들을 크롤링 해보자!(feat.길호넷) (0)	2024.03.21
python 죽지않는 daemon 스크립트 짜기 (0)	2024.03.07
python으로 xdotool을 사용해서 크롬 브라우저를 찾아 특정 포인트를 클릭해보자 (0)	2024.03.07
python 폴더안 특정 폴더만 빼고 삭제하는 코드 (0)	2024.02.21

Contents

새소식

인기 검색어