ABOUT ME

-

Total
-
  • C언어: url HTML 가져오기 (C에서 Python 사용하기)
    컴퓨터/C & C++ 2021. 3. 8. 20:03
    728x90
    반응형

    libcurl

     

    libcurl - the multiprotocol file transfer library

    libcurl - the multiprotocol file transfer library libcurl is a free and easy-to-use client-side URL transfer library, supporting DICT, FILE, FTP, FTPS, GOPHER, GOPHERS, HTTP, HTTPS, IMAP, IMAPS, LDAP, LDAPS, MQTT, POP3, POP3S, RTMP, RTMPS, RTSP, SCP, SFTP,

    curl.se

     

    C언어는 libcurl을 이용해서 http request를 할 수 있다.

     

    그런데 html을 가져와서 modest 엔진이나 google의 gumbo parser를 사용하려고 할 때

    HTML만 파이썬으로 더 간편히 쉽게 불러오는 방법으로, CPython API를 이용해보았다.

    selectolax처럼 C로 작성된 HTML 파서를 이용해서 어려운 부분만 파이썬으로 처리해도 괜찮을 것 같다.

     

    C언어 Python API

    helper.py에 Utils란 클래스를 불러와서, Utils.getHTML(url)을 불러올 것이다.

    Ownership rule을 이용한 코드입니다.

    // main.c
    #include <stdio.h>
    #include <stdlib.h>
    #include <Python.h>
    
    int main() {
        // Borrowed objects
        PyObject * sys = NULL, * path = NULL;
        PyObject * pName = NULL, * pModule = NULL, * pDict = NULL, * pClass = NULL, * pFunctionResult = NULL;
    
        Py_Initialize();
    
        // 파이썬 모듈 위치를 알려주고, 자신의 파이썬 site-packages 경로를 추가해준다.
        sys = PyImport_ImportModule("sys");
        if (!sys) {
            PyErr_Print();
            printf("Error importing sys module\n");
            goto exit;
        }
    
        path = PyObject_GetAttrString(sys, "path");
        if (!path) {
            PyErr_Print();
            printf("Error getting sys.path attribute\n");
            goto exit;
        }
    
        PyList_Append(path, PyUnicode_FromString("."));
        Py_DECREF(path);
        PyList_Append(path, PyUnicode_FromString(getenv("THRID_PACKAGES")));
        Py_DECREF(path);
    
        // Import my helper.py
        pName = PyUnicode_FromString((char * )
            "helper");
        if (!pName) {
            PyErr_Print();
            printf("Error finding helper.py file in sys.path\n");
            goto exit;
        }
    
        // 모듈 import
        pModule = PyImport_Import(pName);
        Py_DECREF(pName);
        if (!pModule) {
            PyErr_Print();
            printf("Error importing python script.\n");
            goto exit;
        }
    
        pDict = PyModule_GetDict(pModule);
    
        // pClass는 helper.py -> Utils 클래스
        pClass = PyDict_GetItemString(pDict, (char * )
            "Utils");
        if (pClass && PyCallable_Check(pClass)) {
            PyObject * object = PyObject_CallObject(pClass, NULL);
            if (!object) {
                PyErr_Print();
                printf("Error calling Utils class\n");
                goto exit;
            }
    
            // getHTML 파이썬 함수 부르기
            pFunctionResult = PyObject_CallMethod(object, "getHTML", "(s)", "https://curl.se/libcurl/");
            Py_DECREF(object);
            if (!pFunctionResult) {
                PyErr_Print();
                goto exit;
            }
    
            PyObject * pHTML = PyUnicode_AsUTF8String(pFunctionResult);
            Py_DECREF(pFunctionResult);
            if (!pHTML) {
                PyErr_Print();
                goto exit;
            }
    
            char * html = PyBytes_AsString(pHTML);
            Py_DECREF(pHTML);
    
            printf("\n%s\n", html); // html로 저장
        } else {
            printf("Error calling Utils class\n");
            PyErr_Print();
            goto exit;
        }
    
    exit:
        Py_XDECREF(sys);
        Py_XDECREF(pModule);
        Py_FinalizeEx();
    
        return 0;
    }

     

    helper.py

    import ssl
    
    from urllib.error import HTTPError, URLError
    from urllib.request import urlopen
    
    
    class Utils:
        __slots__ = ()
    
        @staticmethod
        def getHTML(url=None):
            if url is None:
                raise Exception("Enter url.")
            context = ssl._create_unverified_context()
            try:
                result = urlopen(url, timeout=5.0, context=context)
            except HTTPError:
                print("Seems like the server is down now.")
                return None
            except TimeoutError:
                print("It's taking too long to load website.")
                return None
            except URLError:
                print("Wrong url.")
                return None
    
            html = result.read().decode("utf-8")
            return html
    
    if __name__ == "__main__":
        utils = Utils()
        utils.getHTML("https://www.programiz.com/python-programming/user-defined-exception")
    

     

    C언어 결과

    컴파일 (자신의 파이썬 경로 입력)

    chcp 65001 > NUL& gcc main.c -I E:\Python3.10\include -L E:\Python3.10\libs -l python310 -o main.exe

    char *html을 printf 하면 다음과 같다.

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
    <HTML>
    <HEAD> <TITLE>libcurl - the multiprotocol file transfer library</TITLE>
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
    <link rel="stylesheet" type="text/css" href="https://curl.se/curl.css">
    <link rel="shortcut icon" href="https://curl.se/favicon.ico">
    <link rel="icon" href="https://curl.se/logo/curl-symbol.svg" type="image/svg+xml">
    <meta name="description" content="libcurl is a free and easy-to-use
     client-side URL transfer library, supporting DICT, FILE, FTP, FTPS,
     GOPHER, GOPHERS, HTTP, HTTPS, IMAP, IMAPS, LDAP, LDAPS, MQTT, POP3, POP3S,
     RTMP, RTMPS, RTSP, SCP, SFTP, SMB, SMBS, SMTP, SMTPS, TELNET and
     TFTP. libcurl supports SSL certificates, HTTP POST, HTTP PUT, FTP
     uploading, kerberos, HTTP form based upload, proxies, cookies,
     user+password authentication, file transfer resume, http proxy tunneling
     and more">
    </HEAD>
    <body bgcolor="#ffffff" text="#000000">
    <div class="main">
    <div class="menu">
    <a href="" class="itemselect">libcurl </a>
    <div class="dropdown">
      <a class="dropbtn" href="https://curl.se/libcurl/">Docs</a>
      <div class="dropdown-content">
        <a href="https://curl.se/libcurl/abi.html">ABI</a>
        <a href="https://curl.se/libcurl/c/">API</a>
        <a href="https://curl.se/libcurl/bindings.html">Bindings</a>
        <a href="https://curl.se/libcurl/competitors.html">Competitors</a>
        <a href="https://curl.se/libcurl/c/example.html">Examples</a>
        <a href="https://curl.se/libcurl/features.html">Features</a>
        <a href="https://curl.se/libcurl//mail/list.cgi?list=curl-library.html">Mailist list</a>
        <a href="https://curl.se/libcurl/relatedlibs.html">Related libs</a>
        <a href="https://curl.se/libcurl/using/">Using libcurl</a>
        <a href="https://curl.se/libcurl/security.html">Security</a>
        <a href="https://curl.se/libcurl/c/tutorial.html">Tutorial</a>
        <a href="https://curl.se/libcurl/theysay.html">Testimonials</a>
      </div>
    </div>
    </div>
    <div class="contents">
    <div class="where"><a href="https://curl.se/">curl</a> / <b>libcurl overview</b></div>
    <h1> libcurl - the multiprotocol file transfer library </h1>
    <p>
     libcurl is a <a href="/docs/copyright.html" title="free as in both free
     speech and zero price!">free</a> and easy-to-use client-side URL transfer
     library, supporting DICT, FILE, FTP, FTPS, GOPHER, GOPHERS, HTTP, HTTPS,
     IMAP, IMAPS, LDAP, LDAPS, MQTT, POP3, POP3S, RTMP, RTMPS, RTSP, SCP, SFTP,
     SMB, SMBS, SMTP, SMTPS, TELNET and TFTP. libcurl supports SSL
     certificates, HTTP POST, HTTP PUT, FTP uploading, HTTP form based upload,
     proxies, HTTP/2, HTTP/3, cookies, user+password authentication (Basic,
     Digest, NTLM, Negotiate, Kerberos), file transfer resume, http proxy
     tunneling and more!
    </p>
    <p>
     libcurl is highly portable, it builds and works identically on numerous
     platforms, including Solaris, NetBSD, FreeBSD, OpenBSD, Darwin, HPUX, IRIX,
     AIX, Tru64, Linux, UnixWare, HURD, Windows, Amiga, OS/2, BeOs, Mac OS X,
     Ultrix, QNX, OpenVMS, RISC OS, Novell NetWare, DOS and more...
    </p>
    <p>
     libcurl is <a href="/docs/copyright.html">free</a>, <a
     href="features.html#thread">thread-safe</a>, <a
     href="features.html#ipv6">IPv6 compatible</a>, <a
     href="features.html#features">feature rich</a>, <a
     href="features.html#support">well supported</a>, <a
     href="features.html#fast">fast</a>, <a href="features.html#docs">thoroughly
     documented</a> and is already used by many known, big and successful <a
     href="/docs/companies.html">companies</a>.
    </p>
    <p><b>Download</b>
    <p>
     Go to the regular curl <a href="/download.html">download page</a> and get the
     latest curl package, or one of the specific libcurl packages listed.
    <p><b>API</b>
    <p>
     You use libcurl with the provided <a href="c/">C API</a>. The curl team works
     hard to keep the <a href="features.html#stableapi">API and ABI stable</a>.
     If you prefer using libcurl from your other favorite language, chances are
     there's already a <a href="bindings.html">binding</a> written for it.
    <p><b>Howto</b>
    <p>
     Check out our <a href="using/">using libcurl</a> page for general hints and
     advice, the <a href="competitors.html">free HTTP client library
     comparison</a>.  or read the comparisons against <a
     href="libwww.html">libwww</a> and <a href="wininet.html">WinInet</a>.
    <p>
     libcurl is probably the most portable, most powerful and most often used
     network transfer library on this planet.
    </div>
    </div>
    <script defer src="https://www.fastly-insights.com/insights.js?k=8cb1247c-87c2-4af9-9229-768b1990f90b" type="text/javascript"></script>
    </BODY>
    </HTML>

     

    참고

    풀소스 확인하기

     

    Alfex4936/C-Studies

    C언어 Kore 웹 프레임워크 공부 (카카오 챗봇 시도 예정). Contribute to Alfex4936/C-Studies development by creating an account on GitHub.

    github.com

     

    728x90

    댓글