MySQL: 주변 검색 Spatial 함수 탐구

컴퓨터/MySQL 2024. 3. 6. 16:24

728x90

공간 데이터 처리

카카오맵 API를 이용하여 DB에 저장된 위치 주변 데이터들을 불러오고 싶었다.

Postgres + postgis를 사용하면 ST_DWithin 이란 함수를 통해 쉽게 주변 장소들을 불러올 수 있다.

https://www.ibm.com/docs/ko/db2/11.5?topic=sf-st-dwithin

ST_DWithin

ST_DWithin 함수는 두 Geometry가 지정된 거리 내에 있는지 판별합니다. 구문 매개변수 geometry1 geometry2와 비교할 Geometry를 나타내는 ST_Geometry 유형의 값입니다. geometry2 geometry1과 비교할 Geometry를 표시하

www.ibm.com

두 함수 모두 지리적 객체 간의 거리를 계산되는데 쓰지만 최적화가 다르다.

PostgreSQL로 migration 하는 것도 좋겠지만, MySQL에서 할 수 있는 방안들을 찾아보았다.

ST_DWithin

MySQL과 같이 R-Tree 기반 이지만 GiST 인덱싱이다.

먼저 인덱스를 사용해 후보 객체들을 선별하고, 실제 거리 계산은 필터링된 객체들만 수행한다.

ST_DWithin(geometry g1, geometry g2, double precision distance);

https://github.com/postgis/postgis/blob/376d4ec7caf5b3772a8615f7b44b880bf4942ef8/liblwgeom/measures.c#L207

double
lwgeom_mindistance2d_tolerance(const LWGEOM *lw1, const LWGEOM *lw2, double tolerance)
{
	DISTPTS thedl;
	LWDEBUG(2, "lwgeom_mindistance2d_tolerance is called");
	thedl.mode = DIST_MIN;
	thedl.distance = FLT_MAX;
	thedl.tolerance = tolerance;
	if (lw_dist2d_comp(lw1, lw2, &thedl))
		return thedl.distance;
	/*should never get here. all cases ought to be error handled earlier*/
	lwerror("Some unspecified error.");
	return FLT_MAX;
}


int
lw_dist2d_comp(const LWGEOM *lw1, const LWGEOM *lw2, DISTPTS *dl)
{
	return lw_dist2d_recursive(lw1, lw2, dl);
}

/**
This is a recursive function delivering every possible combination of subgeometries
*/
int
lw_dist2d_recursive(const LWGEOM *lwg1, const LWGEOM *lwg2, DISTPTS *dl)
{
	int i, j;
	int n1 = 1;
	int n2 = 1;
	LWGEOM *g1 = NULL;
	LWGEOM *g2 = NULL;
	LWCOLLECTION *c1 = NULL;
	LWCOLLECTION *c2 = NULL;

	LWDEBUGF(2, "lw_dist2d_comp is called with type1=%d, type2=%d", lwg1->type, lwg2->type);

	if (lw_dist2d_is_collection(lwg1))
	{
		LWDEBUG(3, "First geometry is collection");
		c1 = lwgeom_as_lwcollection(lwg1);
		n1 = c1->ngeoms;
	}
	if (lw_dist2d_is_collection(lwg2))
	{
		LWDEBUG(3, "Second geometry is collection");
		c2 = lwgeom_as_lwcollection(lwg2);
		n2 = c2->ngeoms;
	}

	for (i = 0; i < n1; i++)
	{

		if (lw_dist2d_is_collection(lwg1))
			g1 = c1->geoms[i];
		else
			g1 = (LWGEOM *)lwg1;

		if (lwgeom_is_empty(g1))
			continue;

		if (lw_dist2d_is_collection(g1))
		{
			LWDEBUG(3, "Found collection inside first geometry collection, recursing");
			if (!lw_dist2d_recursive(g1, lwg2, dl))
				return LW_FALSE;
			continue;
		}
		for (j = 0; j < n2; j++)
		{
			if (lw_dist2d_is_collection(lwg2))
				g2 = c2->geoms[j];
			else
				g2 = (LWGEOM *)lwg2;

			if (lw_dist2d_is_collection(g2))
			{
				LWDEBUG(3, "Found collection inside second geometry collection, recursing");
				if (!lw_dist2d_recursive(g1, g2, dl))
					return LW_FALSE;
				continue;
			}

			if (!g1->bbox)
				lwgeom_add_bbox(g1);

			if (!g2->bbox)
				lwgeom_add_bbox(g2);

			/* If one of geometries is empty, skip */
			if (lwgeom_is_empty(g1) || lwgeom_is_empty(g2))
				continue;

			if ((dl->mode != DIST_MAX) && (!lw_dist2d_check_overlap(g1, g2)) &&
			    (g1->type == LINETYPE || g1->type == POLYGONTYPE || g1->type == TRIANGLETYPE) &&
			    (g2->type == LINETYPE || g2->type == POLYGONTYPE || g2->type == TRIANGLETYPE))
			{
				if (!lw_dist2d_distribute_fast(g1, g2, dl))
					return LW_FALSE;
			}
			else
			{
				if (!lw_dist2d_distribute_bruteforce(g1, g2, dl))
					return LW_FALSE;
				if (dl->distance <= dl->tolerance && dl->mode == DIST_MIN)
					return LW_TRUE; /*just a check if the answer is already given*/
			}
		}
	}
	return LW_TRUE;
}

코드가 길지만, 재귀적으로 가능한 모든 subgeometry 조합을 통해 거리를 계산하고 있다.

또 bounding box를 확인하고 없으면 추가해주고, 겹치지 않으면 lw_dist2d_distribute_fast 를 부르고 있다.

아니면 brute force 하게 계산을 하는 것을 알 수 있었다.

지리적 데이터 타입에 대해선 구면 측정을 기본으로 하는 듯하고, 카르테시안 거리를 쓰는 것 같다.

d = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}

ST_Distance_Sphere

모든 공간 함수가 인덱스를 효율적으로 활용하지 못할 수도 있다.

두 지점 간의 최소 구면 (sphere) 거리를 계산하고, 인덱스를 활용하여 사전에 거리 계산 대상을 제한하지 않는다.

그래서 대규모 데이터셋에서는 성능 저하가 있을 수도 있다.

ST_Distance_Sphere(point p1, point p2);

https://dev.mysql.com/doc/refman/8.0/en/spatial-convenience-functions.html#function_st-distance-sphere

MySQL :: MySQL 8.0 Reference Manual :: 14.16.13 Spatial Convenience Functions

MySQL 8.0 Reference Manual / ... / Functions and Operators / Spatial Analysis Functions / Spatial Convenience Functions 14.16.13 Spatial Convenience Functions The functions in this section provide convenience operations on geometry values.

dev.mysql.com

SRID(공간 참조 식별자, 주로 4326)이 0 이면, 카르테시안 공간 참조 시스템에 대해 구현되지 않았다는 오류가 발생한다고 적혀있다.

SRID를 이용하면 Point(경도, 위도) 가 아니고 Point(위도, 경도)로 처리해야한다.

TABLE

SRID + 인덱싱을 MySQL 테이블을 만들었다.

CREATE TABLE Markers (
    MarkerID INT AUTO_INCREMENT PRIMARY KEY,
    UserID INT NULL,
    Location POINT NOT NULL SRID 4326, -- SRID
    Description VARCHAR(255),
    CreatedAt TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    UpdatedAt TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    ...
    SPATIAL INDEX(Location), --- 인덱싱
    FOREIGN KEY (UserID) REFERENCES Users(UserID) ON DELETE SET NULL
);

CREATE INDEX idx_markers_userId ON Markers(UserID);

SRID를 해주면 MySQL optimizer가 인덱스를 좀 더 효율적으로 쓸 수 있다고 한다. (read heavy 면 좋음)

보통은 SRID 4326 (WGS 84), 지구 평균을 쓴다고 한다.

MySQL ST_DWithin

PostgreSQL과 비슷하게 바운딩 박스를 쳐서 구해보고 싶었다.

그래서 공간 인데스를 활용할 수 있는 MBRContains + 바운딩 박스를 이용해

ST_Distance_Sphere 와 비교해봤다.

https://dev.mysql.com/doc/refman/8.0/en/spatial-relation-functions-mbr.html#function_mbrcontains

MySQL :: MySQL 8.0 Reference Manual :: 14.16.9.2 Spatial Relation Functions That Use Minimum Bounding Rectangles

14.16.9.2 Spatial Relation Functions That Use Minimum Bounding Rectangles MySQL provides several MySQL-specific functions that test the relationship between minimum bounding rectangles (MBRs) of two geometries g1 and g2. The return values 1 and 0 indicate

dev.mysql.com

SET @lat = 36.481550006080006;
SET @lon = 127.28920084353089;
SET @delta_lat = 7 / 111320; -- Approximate degrees per meter for latitude
SET @delta_lon = 7 / (111320 * COS(RADIANS(@lat))); -- Approximate degrees per meter for longitude

SET @min_lat = @lat - @delta_lat;
SET @max_lat = @lat + @delta_lat;
SET @min_lon = @lon - @delta_lon;
SET @max_lon = @lon + @delta_lon;

SELECT *
FROM Markers
WHERE MBRContains(
    ST_GeomFromText(CONCAT('POLYGON((', 
        @min_lat, ' ', @min_lon, ', ',
        @max_lat, ' ', @min_lon, ', ',
        @max_lat, ' ', @max_lon, ', ',
        @min_lat, ' ', @max_lon, ', ',
        @min_lat, ' ', @min_lon, '))'), 4326),
    Location
);

SET @lat = 36.4815;
SET @lon = 127.289;
SET @buffer_distance_meters = 10000;
SELECT * 
FROM Markers
WHERE ST_Distance_Sphere(Location, ST_GeomFromText(CONCAT('POINT(', @lat, ' ', @lon, ')'), 4326)) < @buffer_distance_meters;

or ST_Buffer 과 ST_Within

https://dev.mysql.com/doc/refman/8.0/en/spatial-operator-functions.html#function_st-buffer

MySQL :: MySQL 8.0 Reference Manual :: 14.16.8 Spatial Operator Functions

14.16.8 Spatial Operator Functions OpenGIS proposes a number of functions that can produce geometries. They are designed to implement spatial operators. These functions support all argument type combinations except those that are inapplicable according to

dev.mysql.com

SET @lat = 36.481550006080006;
SET @lon = 127.28920084353089;
SET @distance_meters = 10000;


-- Use ST_Within for checking within a certain distance for better performance
SELECT *
FROM Markers
WHERE ST_Within(
    Location,
    ST_Buffer(ST_GeomFromText(CONCAT('POINT(', @lat, ' ', @lon, ')'), 4326), @distance_meters)
);

ST_Within

ST_Within(A, B)` 함수는 첫 번째 지오메트리(A)가 두 번째 지오메트리(B) 내에 완전히 포함되어 있는지 여부를 검사한다. (즉, A가 B의 경계선 내부에 있는지 확인)

ST_Buffer

ST_Buffer(geom, radius)` 함수는 주어진 지오메트리(geom)를 중심으로 지정된 반지름(radius) 거리만큼 모든 방향으로 확장한 영역(다각형이나 라인)을 생성한다.

(주로 특정 지점으로부터 일정 거리 내에 있는 객체를 찾는 데 사용)

위 3 SQL을 비교하려면 "SET profiling = 1;" 하고 "SHOW PROFILES;" 해주면 된다.

평균적으로 1,3번 방법이 약 10배 정도 빨랐다.

Postgresql 15 + POSTGIS

똑같은 테이블을 만들고, 똑같이 인덱싱을 한 후 ST_DWithin과 비교해보았다.

-- Ensure the PostGIS extension is enabled
CREATE EXTENSION IF NOT EXISTS postgis;

-- Create the Markers table
CREATE TABLE Markers (
    MarkerID SERIAL PRIMARY KEY,
    Location GEOGRAPHY(POINT, 4326) NOT NULL,
    Description VARCHAR(255),
    CreatedAt TIMESTAMP WITHOUT TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    UpdatedAt TIMESTAMP WITHOUT TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

-- Create a GIST index on the Location column for fast spatial queries
CREATE INDEX idx_markers_location ON Markers USING GIST (Location);

SELECT *
FROM Markers
WHERE ST_DWithin(
    Location,
    ST_SetSRID(ST_MakePoint(long, lat), 4326),
    10000
);

근데 0.086초로 더 느리다. GEOGRAPHY인지 GEOMETRY인지도 잘 설정해야할 것 같고 설정을 더 만져봐야겠다.

분명 바운딩 박스도 만들어서 잘 검색하는데 Postgresql을 잘 몰라서 느린건지 잘 모르겠다.

참고

https://github.com/gazi-gazi/real-mysql/issues/33

[5.5.2장] R-Tree에서 Contains 또는 Intersect 함수 사용 · Issue #33 · gazi-gazi/real-mysql

p.240 여기서 Contains()나 Intersect() 연산은 사각형 박스와 같은 다각형(Polygon)으로만 연산할 수 있으므로... 요청 : Contains 예시를 보면 Contains(px, 사각상자) 이렇게 나오는데 사각상자에 어떤 방식으

github.com

728x90

저작자표시 비영리 변경금지 (새창열림)

'컴퓨터 > MySQL' 카테고리의 다른 글

Python: ModuleNotFoundError: No module named 'mysql' (0)	2020.12.31
Python MySQL: with OPEN_DB 만들기 (0)	2020.12.15
Python MySQL: 중복 row 인지 확인하기 (0)	2020.12.15

인기포스트 MORE POST

ABOUT ME

STUDY BITS STUDY BITS

공간 데이터 처리

ST_DWithin

ST_Distance_Sphere

TABLE

MySQL ST_DWithin

ST_Within

ST_Buffer

Postgresql 15 + POSTGIS

참고

'컴퓨터 > MySQL' 카테고리의 다른 글

티스토리툴바

인기포스트 MORE POST

ABOUT ME

공간 데이터 처리

ST_DWithin

ST_Distance_Sphere

TABLE

MySQL ST_DWithin

ST_Within

ST_Buffer

Postgresql 15 + POSTGIS

참고

'컴퓨터 > MySQL' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바