Rust: HTML 파싱하기 (crawling)

컴퓨터/Rust 2021. 7. 15. 19:06

728x90

scraper

causal-agent/scraper

HTML parsing and querying with CSS selectors. Contribute to causal-agent/scraper development by creating an account on GitHub.

github.com

In Go lang

Python의 beautifulsoup4, selectolax, Go의 soup처럼 그리 쉽지는 않았다.

아래 대학교 공지를 불러오는 Go언어에서 작성한 코드를 Rust로 작성할 것이다.

import (
	"fmt"
	"strconv"
	"strings"

	"github.com/anaskhan96/soup"
)

// AjouLink is the address of where notices of ajou university are being posted
const AjouLink = "https://www.ajou.ac.kr/kr/ajou/notice.do"

// Notice ...
type Notice struct {
	ID     int64  `db:"id" json:"id"`
	Title  string `db:"title" json:"title"`
	Date   string `db:"date" json:"date"`
	Link   string `db:"link" json:"link"`
	Writer string `db:"writer" json:"writer"`
}

// Parse is a function that parses a length of notices
func Parse(url string, length int) []Notice { // doesn't support default value for parameters
	ajouHTML := url
	if url == "" { // As default, use main link
		ajouHTML = fmt.Sprintf("%v?mode=list&articleLimit=%v&article.offset=0", AjouLink, length)
	}

	notices := []Notice{}

	resp, err := soup.Get(ajouHTML)
	if err != nil {
		fmt.Println("[Parser] Check your HTML connection.", err)
		return notices
	}
	doc := soup.HTMLParse(resp)

	ids := doc.FindAll("td", "class", "b-num-box")
	if len(ids) == 0 {
		fmt.Println("[Parser] Check your parser.")
		return notices
	}

	titles := doc.FindAll("div", "class", "b-title-box")
	dates := doc.FindAll("span", "class", "b-date")
	//links := doc.FindAll("div", "class", "b-title-box")
	writers := doc.FindAll("span", "class", "b-writer")
	for i := 0; i < len(ids); i++ {
		id, _ := strconv.ParseInt(strings.TrimSpace(ids[i].Text()), 10, 64)
		title := strings.TrimSpace(titles[i].Find("a").Text())
		link := titles[i].Find("a").Attrs()["href"]
		date := strings.TrimSpace(dates[i].Text())
		writer := writers[i].Text()

		duplicate := "[" + writer + "]"
		if strings.Contains(title, duplicate) {
			title = strings.TrimSpace(strings.Replace(title, duplicate, "", 1))
		}

		notice := Notice{ID: id, Title: title, Date: date, Link: AjouLink + link, Writer: writer}
		notices = append(notices, notice)
	}

	return notices
}

코드

우선 async 코드를 작성하기 위해 tokio를 사용하고, reqwest로 get/post를 쉽게 할 것이다.

use

use arr_macro::arr;
use reqwest::header::USER_AGENT;
use scraper::{Html, Selector};

struct Notice

불러온 공지는 아래 구조체로 하나하나 저장할 것이다.

#[derive(Debug)]
struct Notice {
    id: u64,
    title: String,
    date: String,
    link: String,
    writer: String,
}

GET url

reqwest로 SSL certificate verification을 하지 않고 GET을 보내는 방법이다.

이 사이트는 USER_AGENT를 지정하지 않으면 404여서 agent로 같이 보냈다.

let client = reqwest::Client::builder()
    .danger_accept_invalid_certs(true)
    .build()?;

// header 없이 보내면 404
let res = client.get(ajou).header(USER_AGENT, "User-Agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36").send().await?;
let body = res.text().await?;

HTML Parse

위 body 데이터를 go언어에서 작성한 방식으로 짜보았다.

어려웠던 점은 select, findAll처럼 하면 array로 받는 것이 아닌 Iterator로 받아서 index 접근을 할 수 없었다.

trait를 implement 하면 될 것 같지만 일단은 아래와 같이 했다.

10개의 공지를 테스트하는 용도여서 index는 0~9까지 함

<a ~>TEXT</a> 에서 TEXT를 가져오고 싶으면

일단 무조건 element.text().collect::<Vec<_>>()[0] 벡터 형식으로 받아야 한다...

(<a>Text <li>Two</li></a> 와 같이 자식이 있는 html은 [Text, Two] 처럼 편하게 받지만

하나만 있을 때 이건 꽤 이상하다. 다른 방법은 분명 있을 것 같다.

// HTML Parse
let document = Html::parse_document(&body);
let a_selector = Selector::parse("a").unwrap();

// Notice has id, title, date, link, writer
let ids = Selector::parse("td.b-num-box").unwrap();
let titles = Selector::parse("div.b-title-box").unwrap(); // includes links
let dates = Selector::parse("span.b-date").unwrap();
let writers = Selector::parse("span.b-writer").unwrap();

let mut notices: [Notice; 10] = arr![Notice::default(); 10];

let mut id_elements = document.select(&ids);
let mut title_elements = document.select(&titles);
let mut date_elements = document.select(&dates);
let mut writer_elements = document.select(&writers);

// struct Notice
for index in 0..10 {
    let id_element = id_elements.next().unwrap();
    let id = id_element.text().collect::<Vec<_>>()[0]
        .trim() // " 12345 "
        .parse::<u64>()
        .unwrap();

    let date_element = date_elements.next().unwrap();
    let date = date_element.text().collect::<Vec<_>>()[0]
        .trim()
        .to_string(); // "2021-07-15"

    let writer_element = writer_elements.next().unwrap();
    let writer = writer_element.text().collect::<Vec<_>>()[0]
        .trim()
        .to_string(); // "가나다라마"

    let title_element = title_elements.next().unwrap();
    let inner_a = title_element.select(&a_selector).next().unwrap();

    let mut title = inner_a.value().attr("title").unwrap().to_string();
    let link = inner_a.value().attr("href").unwrap().to_string();
    // Check duplication. title: [writer] blah -> title: [blah]
    let dup = "[".to_string() + &writer + "]";
    if title.contains(&dup) {
        println!("checked: {}", title);
        title.replace_range(0..dup.len(), "");
        title = title.trim().to_string();
    }
    // println!("id: {}, title: {}, link: {}, date: {}, writer: {}", id, title, link, date, writer);

    notices[index].id = id;
    notices[index].title = title;
    notices[index].link = link;
    notices[index].date = date;
    notices[index].writer = writer;
}

println!("notices: {:?}", notices);

참고

soup를 이용한 Go언어에서 HTML 크롤링

Alfex4936/KakaoChatBot-Golang

Go언어 gin 웹 프레임워크를 이용한 아주대학교 카카오 챗봇. Contribute to Alfex4936/KakaoChatBot-Golang development by creating an account on GitHub.

github.com

Rust HTML 크롤링 풀소스

Alfex4936/Rust-Backend

Rocket web framework을 이용한 Rust 공부. Contribute to Alfex4936/Rust-Backend development by creating an account on GitHub.

github.com

728x90

저작자표시 비영리 변경금지

'컴퓨터 > Rust' 카테고리의 다른 글

Rust: diesel db query 코드 정리 (0)	2021.07.18
Rust: Remove duplicate strs in String (0)	2021.07.15
Rust 백엔드: REST API (Rocket + MySQL) (0)	2021.07.14

인기포스트 MORE POST

ABOUT ME

STUDY BITS STUDY BITS

scraper

In Go lang

코드

use

struct Notice

GET url

HTML Parse

참고

soup를 이용한 Go언어에서 HTML 크롤링

Rust HTML 크롤링 풀소스

'컴퓨터 > Rust' 카테고리의 다른 글

티스토리툴바

인기포스트 MORE POST

ABOUT ME

scraper

In Go lang

코드

use

struct Notice

GET url

HTML Parse

참고

soup를 이용한 Go언어에서 HTML 크롤링

Rust HTML 크롤링 풀소스

'컴퓨터 > Rust' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바