[Day4] 한 권으로 LLM 온라인 스터디 1기

프로그래밍/LLM

[Day4] 한 권으로 LLM 온라인 스터디 1기 - 파인튜닝 개념

31weeks 2025. 1. 25. 17:37

728x90

3.1 전체 파인튜닝 데이터 준비

3.1.1 전체 파인튜닝의 원리와 종류

파인튜닝이란?
- 이미 학습되어 공개된 언어 모델(Pre-trained Language Model,)을 특정 작업에 맞게 추가로 학습하는 것
- 성능을 높이고자 하는분야나 풀고자 하는 문제의 데이터로 추가 학습
→ 모델이 해당 분야에서 더 정확하고 신뢰할 수 있는 응답을 생성할 수 있게 됨
파인튜닝을 하는 이유
- 처음부터 모델을 개발하는 것 보다 훨씬 더 경제적이고 편리함
- 특정분야의 데이터는 매우 적음 → 과적합, 자연스러운 언어생성 능력 부족
파인튜닝의 종류
- PEFT(Parameter-Efficient Fine-Tuning, 매개변수 효율적 파인튜닝)
a. 어댑터 튜닝(Adapter Tuning) : 기존의 거대한 언어 모델에 작은 규모의 신경망을 새롭게 추가
b. 프롬프트 튜닝(Prompt Tuning) : 모델에 입력되는 텍스트에 특별한 지시사항을 추가
c. LoRA(Low-Rank Adaptation) : 작은 크기의 두 행렬을 곱한 결과를 원래 행렬이 더하는 방식으로 모델을 효율적으로 조정하는 기술 → 특히 대규모 모델을 파인튜닝 할 때 계산 자원과 시간을 절약할 수 있는 효율적인 방법임
파인튜닝을 할 때 주의할 점
- 과적합(Overfitting)
- 재앙적 망각 현상(Catastrophic Forgetting)
- 막대한 연산자원과 시간이 필요(7B이상 되는 모델의 경우)
- 데이터의 품질과 양(Garbage In, Garbage Out)

3.1.2 다양한 데이터셋

다음 토큰 생성 / 텍스트 생성
- 영어 데이터
https://commoncrawl.org/overview
https://skylion007.github.io/OpenWebTextCorpus/
- 한국어 데이터
https://huggingface.co/datasets/legacy-datasets/mc4
대화형 태스크
- 영어 데이터
https://github.com/budzianowski/multiwoz
https://parl.ai/projects/convai2/
https://parl.ai/docs/tasks.html#daily-dialog
https://huggingface.co/datasets/AlekseyKorshuk/persona-chat
- 한국어 데이터
https://www.aihub.or.kr/aihubdata/data/list.do?pageIndex=1&currMenu=&topMenu=&dataSetSn=&srchdataClCode=DATACL001&srchDataRealmCode=REALM002&srchDataTy=DATA003&searchKeyword=%EB%8C%80%ED%99%94&srchDetailCnd=DETAILCND001&srchOrder=ORDER001&srchPagePer=20
질의응답
- 영어 데이터
https://huggingface.co/datasets/rajpurkar/squad
https://huggingface.co/datasets/google-research-datasets/natural_questions
https://huggingface.co/datasets/mandarjoshi/trivia_qa
- 한국어 데이터
https://huggingface.co/datasets/lmqg/qg_koquad
요약
- 영어 데이터
https://huggingface.co/datasets/abisee/cnn_dailymail
https://huggingface.co/datasets/EdinburghNLP/xsum
- 한국어 데이터
https://www.aihub.or.kr/aihubdata/data/list.do?pageIndex=1&currMenu=&topMenu=&dataSetSn=&srchdataClCode=DATACL001&srchDataRealmCode=REALM002&srchDataTy=DATA003&searchKeyword=%EC%9A%94%EC%95%BD&srchDetailCnd=DETAILCND001&srchOrder=ORDER001&srchPagePer=20
기계번역
- 한국어-영어 번역 데이터
https://ko-nlp.github.io/Korpora/en-docs/corpuslist/aihub_translation.html
https://ko-nlp.github.io/Korpora/ko-docs/corpuslist/aihub_translation.html
바꿔쓰기
- 영어 데이터
https://huggingface.co/datasets/google-research-datasets/paws
https://huggingface.co/datasets/AlekseyKorshuk/quora-question-pairs
- 한국어 데이터
https://huggingface.co/datasets/ohgnues/korean-qa-paraphrase
코드 생성 및 이해
- 데이터
https://huggingface.co/datasets?sort=downloads&search=CodeSearchNet
검색 증강 생성
- 영어 데이터
https://huggingface.co/datasets/wikimedia/wikipedia/viewer/20231101.en
https://huggingface.co/datasets/neural-bridge/rag-dataset-12000
- 한국어 데이터
https://huggingface.co/datasets/wikimedia/wikipedia/viewer/20231101.ko

3.1.3 데이터 전처리

텍스트 전처리 : 원본 텍스트를 컴퓨터가 더 쉽게 분석할 수 있는 형태로 바꾸는 과정
데이터 클렌징 : 잘못되거나 불완전한 부분을 찾아고치거나 없애는 과정
토큰화 : 텍스트를 의미있는 작은 단위로 나눔
정규화 : 텍스트를 인공지능 학습에 용이한 형태(일관된 형식)로 만듬

728x90

저작자표시 비영리 변경금지

'프로그래밍 > LLM' 카테고리의 다른 글

[Day6] 한 권으로 LLM 온라인 스터디 1기 - GPU 병렬화 기법 (0)	2025.01.25
[Day5] 한 권으로 LLM 온라인 스터디 1기 - GPT, Gemma, Llama3 모델 특징 비교 (0)	2025.01.25
[Day3] 한 권으로 LLM 온라인 스터디 1기 - 멀티헤드 어텐션 & 피드포워드 (0)	2025.01.25
[Day2] 한 권으로 LLM 온라인 스터디 1기 - 언어 모델 구조 및 셀프 어텐션 메커니즘 이해 (0)	2025.01.24
[Day1] 한 권으로 LLM 온라인 스터디 1기 - NLP 이해와 런팟 설치 (0)	2025.01.24

현재글[Day4] 한 권으로 LLM 온라인 스터디 1기 - 파인튜닝 개념

250x250

31weeks blog

풀이, 암호화폐, 비트코인, 학습 방법, 이지함, 운세, 문제풀이, 기출문제, 작괘법, MBTI, 원본해설, 토정비결, 기술사, 가이드, 가스, 괘상, 괘상수, 트럼프, 사주팔자, 파이썬,

Today :
Yesterday :

31weeks

[Day4] 한 권으로 LLM 온라인 스터디 1기 - 파인튜닝 개념

3.1 전체 파인튜닝 데이터 준비

'프로그래밍 > LLM' 카테고리의 다른 글

'프로그래밍/LLM'의 다른글

티스토리툴바

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

[Day4] 한 권으로 LLM 온라인 스터디 1기 - 파인튜닝 개념

3.1 전체 파인튜닝 데이터 준비

'프로그래밍 > LLM' 카테고리의 다른 글

'프로그래밍/LLM'의 다른글

관련글

티스토리툴바