YOLO, the Freak 1.

2026-05-20, 09:00 6 분 소요

시작하면서.

과제를 하면서 뭔가 YOLO에서 아이디어를 얻어야 하는 상황이 생겼다.
컴퓨터 비전/사이언스의 꽃 중 하나가 어떤 사물을 인식하는 기술이고, 그 기술 중 실시간성의 정점에 서 있는 것이 YOLO 다.
가볍게 접한다면 정말 가벼우면서도, 파고들기 시작하면 공부해야 할게 너무 많아지는 이 기술에 대하여, 내가 공부하면서 적어 내려가는 학습 노트다.

우선 돌려보자.

YOLO를 공부해봐야지 하고 학습 동영상을 검색하다보면, 생각보다 영상이 짧다는걸 알 수 있다.
이상하다 싶어서 다른 영상을 찾아봐도, 대부분 10~20분 사이의 영상들이 많음을 알 수 있다.
그만큼 YOLO 는 이미 구현과 완성이 잘 되어있어서, 정말 거의 딸깍 수준으로 사용하는데 큰 무리가 없다.

그러니 한번 해보고 이야기해보자.

우선 필수 패키지가 있는지 확인해보자.

(bash)
!pip install ultralytics opencv-python

이미처리를 해야 하니 OpenCV가 있는지 확인해야 할거고,
YOLO 라이브러리를 사용하기 위해 가장 유명한 업체인 Ultralytics 의 라이브러리를 가져와본다.

이제 파이썬에서 작업해보자.
모델을 불러와본다.

from ultralytics import YOLO
model = YOLO("yolo11n.pt")

끝이다.

좀 싱겁지 않나? 그만큼 라이브러리가 잘되어있다는 증거일게다.

그럼 이걸로 뭘 할 수 있나? 일단 쉽게 사진 하나를 넣어보자.

여기 이미지 이름을 my_son.png로 가정하자.

#이미지를 한번 화면에 출력하려한다. 
import matplotlib.pyplot as plt

#이미지를 불러온다. 아래 이미지파일의 경로는 원하는 이미지가 있는 경로를 쓰자. 
img = cv2.imread('./my_son.png',cv2.IMREAD_COLOR)

#이하 아래는 jupyer notebook 환경에서 출력하기 위해 작성한 코드다. 
img_rgb = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)
plt.figure(figsize=(10, 14))
plt.imshow(img_rgb)
plt.axis("off")
plt.show()

그러면 이 이미지를 모델에 넣으면 어떤 결과가 나올까?

results = model(img, conf=0.25, iou=0.5)

출력창에 아래와 같은 내용이 뜰것이다.

0: 640x480 1 person, 1 snowboard, 3.0ms
Speed: 13.0ms preprocess, 3.0ms inference, 0.6ms postprocess per image at shape (1, 3, 640, 480)

딱히 GPU 지정하지 않았으므로, CPU만 썼기때문에 느릴것 같지만, 예측에는 3ms 밖에 사용하지 않았다. 추가적으로 13ms 는 이미지 리사이즈에 썼다. 그렇다. 이미지 리사이즈가 더 오래걸렸다…

출력값에는 어떤 것들이 저장되어있을까? 한번 results 를 한번 출력해보자.

print(results)


##아래는 결과값이다. 
[ultralytics.engine.results.Results object with attributes:

boxes: ultralytics.engine.results.Boxes object
keypoints: None
masks: None
names: {0: 'person', 1: 'bicycle', 2: 'car', 3: 'motorcycle', 4: 'airplane', 5: 'bus', 6: 'train', 7: 'truck', 8: 'boat', 9: 'traffic light', 10: 'fire hydrant', 11: 'stop sign', 12: 'parking meter', 13: 'bench', 14: 'bird', 15: 'cat', 16: 'dog', 17: 'horse', 18: 'sheep', 19: 'cow', 20: 'elephant', 21: 'bear', 22: 'zebra', 23: 'giraffe', 24: 'backpack', 25: 'umbrella', 26: 'handbag', 27: 'tie', 28: 'suitcase', 29: 'frisbee', 30: 'skis', 31: 'snowboard', 32: 'sports ball', 33: 'kite', 34: 'baseball bat', 35: 'baseball glove', 36: 'skateboard', 37: 'surfboard', 38: 'tennis racket', 39: 'bottle', 40: 'wine glass', 41: 'cup', 42: 'fork', 43: 'knife', 44: 'spoon', 45: 'bowl', 46: 'banana', 47: 'apple', 48: 'sandwich', 49: 'orange', 50: 'broccoli', 51: 'carrot', 52: 'hot dog', 53: 'pizza', 54: 'donut', 55: 'cake', 56: 'chair', 57: 'couch', 58: 'potted plant', 59: 'bed', 60: 'dining table', 61: 'toilet', 62: 'tv', 63: 'laptop', 64: 'mouse', 65: 'remote', 66: 'keyboard', 67: 'cell phone', 68: 'microwave', 69: 'oven', 70: 'toaster', 71: 'sink', 72: 'refrigerator', 73: 'book', 74: 'clock', 75: 'vase', 76: 'scissors', 77: 'teddy bear', 78: 'hair drier', 79: 'toothbrush'}
obb: None
orig_img: array([[[114, 127, 139],
        [124, 137, 149],
        [124, 137, 149],
        ...,
        [120, 137, 148],
        [127, 144, 155],
        [126, 143, 154]],

       [[121, 134, 146],
        [129, 142, 154],
        [129, 142, 154],
        ...,
        [133, 150, 161],
        [119, 136, 147],
        [121, 138, 149]],

       [[123, 136, 148],
        [121, 134, 146],
        [126, 139, 151],
        ...,
        [130, 147, 158],
        [115, 132, 143],
        [123, 140, 151]],

       ...,

       [[118, 152, 178],
        [106, 140, 166],
        [104, 138, 164],
        ...,
        [138, 140, 141],
        [145, 147, 148],
        [149, 151, 152]],

       [[101, 135, 161],
        [125, 159, 185],
        [114, 148, 174],
        ...,
        [149, 151, 151],
        [150, 152, 152],
        [148, 150, 150]],

       [[108, 142, 168],
        [136, 170, 196],
        [119, 153, 179],
        ...,
        [152, 154, 154],
        [151, 153, 153],
        [156, 158, 158]]], shape=(4000, 3000, 3), dtype=uint8)
orig_shape: (4000, 3000)
path: 'image0.jpg'
probs: None
save_dir: './runs/detect/predict' #이건 내가 편집한 경로임.
speed: {'preprocess': 13.026013970375061, 'inference': 2.971547655761242, 'postprocess': 0.616930890828371}]

하나하나 씹어보자.

boxes: ultralytics.engine.results.Boxes object

이건 모델이 객체를 검출하여 사각형 박스로 표현 가능하다는 말이다.

boxes 도 내부에 가지고 있는 instance들이 많다.

ultralytics.engine.results.Boxes object with attributes:

cls: tensor([ 0., 31.], device='cuda:0')
conf: tensor([0.9348, 0.5868], device='cuda:0')
data: tensor([[6.3112e+02, 1.4457e+03, 2.3588e+03, 3.9909e+03, 9.3475e-01, 0.0000e+00],
        [1.8425e+03, 1.0932e+03, 2.4676e+03, 2.4155e+03, 5.8681e-01, 3.1000e+01]], device='cuda:0')
id: None
is_track: False
orig_shape: (4000, 3000)
shape: torch.Size([2, 6])
xywh: tensor([[1494.9712, 2718.2510, 1727.7030, 2545.1985],
        [2155.0320, 1754.3097,  625.1235, 1322.2996]], device='cuda:0')
xywhn: tensor([[0.4983, 0.6796, 0.5759, 0.6363],
        [0.7183, 0.4386, 0.2084, 0.3306]], device='cuda:0')
xyxy: tensor([[ 631.1198, 1445.6516, 2358.8228, 3990.8501],
        [1842.4702, 1093.1599, 2467.5938, 2415.4595]], device='cuda:0')
xyxyn: tensor([[0.2104, 0.3614, 0.7863, 0.9977],
        [0.6142, 0.2733, 0.8225, 0.6039]], device='cuda:0')

흔히 많이 쓰는 값은

result.boxes.xyxy # [x1, y1, x2, y2] 좌표
result.boxes.xywh # [x_center, y_center, width, height]
result.boxes.conf # confidence score
result.boxes.cls # class id

들이다. 출력형태가 tensor라고 되어있지만, 사실 좌표값이니 걱정할 필요는 없는데, 이상하다. 소수점으로 되어있다. 이는 네트워크가 추정을 하면서 생기는 소수점이므로, 정확히 int로 떨어지지 않기때문이다. 걱정하지말고, 그냥 int로 바꾸자. 결국 이것때문에 필수적으로 오차가 발생한다고 봐도 무방하겠다.

xyxy: tensor([[ 631.1198, 1445.6516, 2358.8228, 3990.8501], [1842.4702, 1093.1599, 2467.5938, 2415.4595]], device=’cuda:0’)

자, 이제 YOLO에서 흔하게 알고 있는 그 그림. 추출된 객체에 Box를 그려보자. 이 박스를 정식용어로 Bounding Box라고 한다. 기억해두자.

result = results[0]
boxes = result.boxes.xyxy.cpu().numpy().astype(int)

img_draw = img_bgr.copy()

for (x1, y1, x2, y2) in boxes:
    cv2.rectangle(
        img_draw,
        (x1, y1),
        (x2, y2),
        (0, 255, 0),
        10
    )


img_rgb = cv2.cvtColor(img_draw, cv2.COLOR_BGR2RGB)

plt.figure(figsize=(10, 14))
plt.imshow(img_rgb)
plt.axis("off")
plt.show()

2026-05-20-14-55-30 — 아들아. 미안하다. 처음이 어렵지 두번은 쉽다.

어째되었건 객체가 인식되었다.

그럼 YOLO는 이 객체들을 뭐라고 분류했을까? 이어서 실행해보자.


#들여쓰기에 주의하자. 
for i in range(len(class_ids)):
  x1, y1, x2, y2 = boxes[i]
  conf = confs[i]
  class_id = class_ids[i]
  label = f"id:{class_id} {result.names[class_id]} {conf:.2f}"

  cv2.putText(
      img_draw,
      label,
      (x1, max(y1 - 12, 35)),
      cv2.FONT_HERSHEY_SIMPLEX,
      2,
      (0, 0, 0),
      3,
      cv2.LINE_AA,
  )


#여기는 이미지 출력에 대한 곳이므로, 이전과 같다. 
img_rgb = cv2.cvtColor(img_draw, cv2.COLOR_BGR2RGB)

plt.figure(figsize=(10, 14))
plt.imshow(img_rgb)
plt.axis("off")
plt.show()

이제 boxes 안에 있는 instance중 class id값과 confidence값이 나온다. class id는 어떤 모델로 분류했는가를 알려주는 것이고, confidence는 모델이 얼마나 확신을 가지고 검출했는지 알려준다.

어? 그럼 두 사물이 겹치는 건 어떻게 될까? 겹치는 영역에 대해선 모델별로 출력이 약간 다르므로, 그건 추후에 YOLO의 여러 버전들을 다루면서 이야기해보자.

2026-05-20-16-10-45 — 아들아. 그래도 넌 사람이구나. 근데 뒤의 티셔츠가 왜..?

검출된 사진을 자세히 보면, 뒤의 티셔츠가 스노우보드로 등록되어있다. confidence값도 0.59 정도로, 제법 높게 되어있음을 알 수 있다. 왜지??

이제 클래스 목록을 다시한번 들여다보자.
앞서서 results를 출력해본 적이 있다.

print(results)

[ultralytics.engine.results.Results object with attributes:

boxes: ultralytics.engine.results.Boxes object
keypoints: None
masks: None
names: {0: 'person', 1: 'bicycle', 2: 'car', 3: 'motorcycle', 4: 'airplane', 5: 'bus', 6: 'train', 7: 'truck', 8: 'boat', 9: 'traffic light', 10: 'fire hydrant', 11: 'stop sign', 12: 'parking meter', 13: 'bench', 14: 'bird', 15: 'cat', 16: 'dog', 17: 'horse', 18: 'sheep', 19: 'cow', 20: 'elephant', 21: 'bear', 22: 'zebra', 23: 'giraffe', 24: 'backpack', 25: 'umbrella', 26: 'handbag', 27: 'tie', 28: 'suitcase', 29: 'frisbee', 30: 'skis', 31: 'snowboard', 32: 'sports ball', 33: 'kite', 34: 'baseball bat', 35: 'baseball glove', 36: 'skateboard', 37: 'surfboard', 38: 'tennis racket', 39: 'bottle', 40: 'wine glass', 41: 'cup', 42: 'fork', 43: 'knife', 44: 'spoon', 45: 'bowl', 46: 'banana', 47: 'apple', 48: 'sandwich', 49: 'orange', 50: 'broccoli', 51: 'carrot', 52: 'hot dog', 53: 'pizza', 54: 'donut', 55: 'cake', 56: 'chair', 57: 'couch', 58: 'potted plant', 59: 'bed', 60: 'dining table', 61: 'toilet', 62: 'tv', 63: 'laptop', 64: 'mouse', 65: 'remote', 66: 'keyboard', 67: 'cell phone', 68: 'microwave', 69: 'oven', 70: 'toaster', 71: 'sink', 72: 'refrigerator', 73: 'book', 74: 'clock', 75: 'vase', 76: 'scissors', 77: 'teddy bear', 78: 'hair drier', 79: 'toothbrush'}

이 names가 바로 COCO 클래스 목록인데 여기 가만히 보면 티셔츠가 없다……
즉 여기 없으면 분류해내지 못한단 소리다. 그래서 가장 비슷한 snow보드라는 결론을 뱉어 낸 것이다.

그래서, YOLO를 통해 특정 개체를 좀더 잘 찾아내고 싶다면 Fine-tuning을 거쳐 해당 클래스를 집어 넣고 학습하는 과정을 거쳐야 한다.

이제 그러면 학습하는 과정에 대해 좀 더 알아보도록 하자.

오늘의 요약

YOLO 모델 불러오는 방법
YOLO모델로 예측하기.
Bounding box 그려보기

Twitter Facebook LinkedIn

송영국

YOLO, the Freak 1.

시작하면서.

우선 돌려보자.

오늘의 요약

공유하기

댓글남기기