-
C언어: url HTML 가져오기 (C에서 Python 사용하기)컴퓨터/C & C++ 2021. 3. 8. 20:03728x90반응형
libcurl
C언어는 libcurl을 이용해서 http request를 할 수 있다.
그런데 html을 가져와서 modest 엔진이나 google의 gumbo parser를 사용하려고 할 때
HTML만 파이썬으로 더 간편히 쉽게 불러오는 방법으로, CPython API를 이용해보았다.
selectolax처럼 C로 작성된 HTML 파서를 이용해서 어려운 부분만 파이썬으로 처리해도 괜찮을 것 같다.
C언어 Python API
helper.py에 Utils란 클래스를 불러와서, Utils.getHTML(url)을 불러올 것이다.
Ownership rule을 이용한 코드입니다.
// main.c #include <stdio.h> #include <stdlib.h> #include <Python.h> int main() { // Borrowed objects PyObject * sys = NULL, * path = NULL; PyObject * pName = NULL, * pModule = NULL, * pDict = NULL, * pClass = NULL, * pFunctionResult = NULL; Py_Initialize(); // 파이썬 모듈 위치를 알려주고, 자신의 파이썬 site-packages 경로를 추가해준다. sys = PyImport_ImportModule("sys"); if (!sys) { PyErr_Print(); printf("Error importing sys module\n"); goto exit; } path = PyObject_GetAttrString(sys, "path"); if (!path) { PyErr_Print(); printf("Error getting sys.path attribute\n"); goto exit; } PyList_Append(path, PyUnicode_FromString(".")); Py_DECREF(path); PyList_Append(path, PyUnicode_FromString(getenv("THRID_PACKAGES"))); Py_DECREF(path); // Import my helper.py pName = PyUnicode_FromString((char * ) "helper"); if (!pName) { PyErr_Print(); printf("Error finding helper.py file in sys.path\n"); goto exit; } // 모듈 import pModule = PyImport_Import(pName); Py_DECREF(pName); if (!pModule) { PyErr_Print(); printf("Error importing python script.\n"); goto exit; } pDict = PyModule_GetDict(pModule); // pClass는 helper.py -> Utils 클래스 pClass = PyDict_GetItemString(pDict, (char * ) "Utils"); if (pClass && PyCallable_Check(pClass)) { PyObject * object = PyObject_CallObject(pClass, NULL); if (!object) { PyErr_Print(); printf("Error calling Utils class\n"); goto exit; } // getHTML 파이썬 함수 부르기 pFunctionResult = PyObject_CallMethod(object, "getHTML", "(s)", "https://curl.se/libcurl/"); Py_DECREF(object); if (!pFunctionResult) { PyErr_Print(); goto exit; } PyObject * pHTML = PyUnicode_AsUTF8String(pFunctionResult); Py_DECREF(pFunctionResult); if (!pHTML) { PyErr_Print(); goto exit; } char * html = PyBytes_AsString(pHTML); Py_DECREF(pHTML); printf("\n%s\n", html); // html로 저장 } else { printf("Error calling Utils class\n"); PyErr_Print(); goto exit; } exit: Py_XDECREF(sys); Py_XDECREF(pModule); Py_FinalizeEx(); return 0; }
helper.py
import ssl from urllib.error import HTTPError, URLError from urllib.request import urlopen class Utils: __slots__ = () @staticmethod def getHTML(url=None): if url is None: raise Exception("Enter url.") context = ssl._create_unverified_context() try: result = urlopen(url, timeout=5.0, context=context) except HTTPError: print("Seems like the server is down now.") return None except TimeoutError: print("It's taking too long to load website.") return None except URLError: print("Wrong url.") return None html = result.read().decode("utf-8") return html if __name__ == "__main__": utils = Utils() utils.getHTML("https://www.programiz.com/python-programming/user-defined-exception")
C언어 결과
컴파일 (자신의 파이썬 경로 입력)
chcp 65001 > NUL& gcc main.c -I E:\Python3.10\include -L E:\Python3.10\libs -l python310 -o main.exe
char *html을 printf 하면 다음과 같다.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <HTML> <HEAD> <TITLE>libcurl - the multiprotocol file transfer library</TITLE> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <meta content="text/html; charset=UTF-8" http-equiv="Content-Type"> <link rel="stylesheet" type="text/css" href="https://curl.se/curl.css"> <link rel="shortcut icon" href="https://curl.se/favicon.ico"> <link rel="icon" href="https://curl.se/logo/curl-symbol.svg" type="image/svg+xml"> <meta name="description" content="libcurl is a free and easy-to-use client-side URL transfer library, supporting DICT, FILE, FTP, FTPS, GOPHER, GOPHERS, HTTP, HTTPS, IMAP, IMAPS, LDAP, LDAPS, MQTT, POP3, POP3S, RTMP, RTMPS, RTSP, SCP, SFTP, SMB, SMBS, SMTP, SMTPS, TELNET and TFTP. libcurl supports SSL certificates, HTTP POST, HTTP PUT, FTP uploading, kerberos, HTTP form based upload, proxies, cookies, user+password authentication, file transfer resume, http proxy tunneling and more"> </HEAD> <body bgcolor="#ffffff" text="#000000"> <div class="main"> <div class="menu"> <a href="" class="itemselect">libcurl </a> <div class="dropdown"> <a class="dropbtn" href="https://curl.se/libcurl/">Docs</a> <div class="dropdown-content"> <a href="https://curl.se/libcurl/abi.html">ABI</a> <a href="https://curl.se/libcurl/c/">API</a> <a href="https://curl.se/libcurl/bindings.html">Bindings</a> <a href="https://curl.se/libcurl/competitors.html">Competitors</a> <a href="https://curl.se/libcurl/c/example.html">Examples</a> <a href="https://curl.se/libcurl/features.html">Features</a> <a href="https://curl.se/libcurl//mail/list.cgi?list=curl-library.html">Mailist list</a> <a href="https://curl.se/libcurl/relatedlibs.html">Related libs</a> <a href="https://curl.se/libcurl/using/">Using libcurl</a> <a href="https://curl.se/libcurl/security.html">Security</a> <a href="https://curl.se/libcurl/c/tutorial.html">Tutorial</a> <a href="https://curl.se/libcurl/theysay.html">Testimonials</a> </div> </div> </div> <div class="contents"> <div class="where"><a href="https://curl.se/">curl</a> / <b>libcurl overview</b></div> <h1> libcurl - the multiprotocol file transfer library </h1> <p> libcurl is a <a href="/docs/copyright.html" title="free as in both free speech and zero price!">free</a> and easy-to-use client-side URL transfer library, supporting DICT, FILE, FTP, FTPS, GOPHER, GOPHERS, HTTP, HTTPS, IMAP, IMAPS, LDAP, LDAPS, MQTT, POP3, POP3S, RTMP, RTMPS, RTSP, SCP, SFTP, SMB, SMBS, SMTP, SMTPS, TELNET and TFTP. libcurl supports SSL certificates, HTTP POST, HTTP PUT, FTP uploading, HTTP form based upload, proxies, HTTP/2, HTTP/3, cookies, user+password authentication (Basic, Digest, NTLM, Negotiate, Kerberos), file transfer resume, http proxy tunneling and more! </p> <p> libcurl is highly portable, it builds and works identically on numerous platforms, including Solaris, NetBSD, FreeBSD, OpenBSD, Darwin, HPUX, IRIX, AIX, Tru64, Linux, UnixWare, HURD, Windows, Amiga, OS/2, BeOs, Mac OS X, Ultrix, QNX, OpenVMS, RISC OS, Novell NetWare, DOS and more... </p> <p> libcurl is <a href="/docs/copyright.html">free</a>, <a href="features.html#thread">thread-safe</a>, <a href="features.html#ipv6">IPv6 compatible</a>, <a href="features.html#features">feature rich</a>, <a href="features.html#support">well supported</a>, <a href="features.html#fast">fast</a>, <a href="features.html#docs">thoroughly documented</a> and is already used by many known, big and successful <a href="/docs/companies.html">companies</a>. </p> <p><b>Download</b> <p> Go to the regular curl <a href="/download.html">download page</a> and get the latest curl package, or one of the specific libcurl packages listed. <p><b>API</b> <p> You use libcurl with the provided <a href="c/">C API</a>. The curl team works hard to keep the <a href="features.html#stableapi">API and ABI stable</a>. If you prefer using libcurl from your other favorite language, chances are there's already a <a href="bindings.html">binding</a> written for it. <p><b>Howto</b> <p> Check out our <a href="using/">using libcurl</a> page for general hints and advice, the <a href="competitors.html">free HTTP client library comparison</a>. or read the comparisons against <a href="libwww.html">libwww</a> and <a href="wininet.html">WinInet</a>. <p> libcurl is probably the most portable, most powerful and most often used network transfer library on this planet. </div> </div> <script defer src="https://www.fastly-insights.com/insights.js?k=8cb1247c-87c2-4af9-9229-768b1990f90b" type="text/javascript"></script> </BODY> </HTML>
참고
728x90'컴퓨터 > C & C++' 카테고리의 다른 글
C언어: 간단한 REST api 웹 서버 만들어보기 (0) 2021.03.09 C++ Insertion Sort 및 성능 테스트 (0) 2020.10.05 C/C++ Entity Component System (Flecs) (0) 2020.08.04