sklearn - 스팸 메세지 분류(spam-text-message-classification)

728x90

notebook

spam-text-message-classification.ipynb

0.01MB

Spam Text 데이터셋

https://www.kaggle.com/datasets/team-ai/spam-text-message-classification

Spam Text Message Classification

Let's battle with annoying spammer with data science.

www.kaggle.com

Write-up

데이터 관리에 필요한 라이브러리들 import

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

SPAM Text 데이터셋 불러오기 및 데이터 요약

df = pd.read_csv("/kaggle/input/spam-text-message-classification/SPAM text message 20170820 - Data.csv")
df.head()

요약 결과 테이블의 컬럼은 Category, Message로 구분이 되어 있다. 메시지 내용이 스팸인 경우 카테고리가 spam 아닐 경우는 ham으로 구분

메시지들은 X 변수에 할당, 카테고리는 y 변수에 할당한다.

X = df['Message']
y = df['Category']
len(X)

train_test_split() : 머신러닝 모델 학습 및 평가 과정에서 중요한 역할을 하는 데이터 분할 함수

출력 결과로는 학습에 필요한 (X_train, y_train) 출력과 테스트에 사용하는 (X_test, y_test)가 리턴된다.

출력 결과는 랜덤이다.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=22)

다시 머신러닝에 사용하는 필요 라이브러리들을 import

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report

파이프라인(Pipeline) : 데이터 전처리에서 학습까지의 과정을 하나로 연결해주는 것

https://medium.com/geekculture/how-to-use-sklearn-pipelines-to-simplify-machine-learning-workflow-bde1cebb9fa2

파이프라인 함수 구조

파이프라인은 두 개로 이루어진 튜플을 리스트로 인자를 받는다.

pipeline( [ ('작업명1', 작업 클래스 1), ('작업명2', 작업 클래스 2) ] )

파이프라인이 없어도 학습은 가능하지만 일련의 과정들(변수 선택, 표준화, 모델 학습)을 하나하나 코딩을 해야 하므로 복잡해지는데 파이프라인을 사용하면 일련의 과정들을 한 번에 해결 가능하다.

파이프라인 사용 예시

아래는 파이프라인으로 학습 부터 예측, 성능 평가를 구현한 코드이다. (출처 : https://zephyrus1111.tistory.com/254)

## 작업 등록
pipeline = Pipeline([('Feature_Selection', SelectKBest(f_classif, k=2)), ## 변수 선택
 ('Standardization', StandardScaler()),  ## 표준화
  ('Decision_Tree', DecisionTreeClassifier(max_depth=3)) ## 학습 모델
])

pipeline.fit(X, y) ## 모형 학습
print(pipeline.predict(X)[:3]) ## 예측
print(pipeline.score(X, y)) ## 성능 평가

전처리 & 분류기 선택

TfidfVectorizer()는 우리가 읽을 수 있는 글자를 자연어라고 하는데 이를 컴퓨터가 처리하기 쉽도록 바이너리 형태와 같은 문장 벡터화 처리하는 전처리 함수이다. 파이프라인 함수 구조에서는 작업 클래스 1에 위치한다.

그 외에 MultinomialNB, ComplementNB나 LinearSVC는 분류 문제에서 사용되는 클래스이며 파이프라인 함수 구조에서는 작업 클래스 2에 위치한다.

pipeMNB = Pipeline([('tfidf', TfidfVectorizer(stop_words='english', ngram_range=(1,3))), ('clf', MultinomialNB())])
pipeCNB = Pipeline([('tfidf', TfidfVectorizer(ngram_range=(1,3))), ('clf', ComplementNB())])
pipeSVC = Pipeline([('tfidf', TfidfVectorizer(ngram_range=(1,3))), ('clf', LinearSVC())])

학습(fit) 및 예측(predict)

이제 만든 파이프라인으로 학습(fit)과 예측(predict)을 하고 정확도(accuracy score)를 출력한다.

pipeMNB.fit(X_train, y_train)
predictMNB = pipeMNB.predict(X_test)
print(f"MNB: {accuracy_score(y_test, predictMNB):.2f}")

pipeCNB.fit(X_train, y_train)
predictCNB = pipeCNB.predict(X_test)
print(f"CNB: {accuracy_score(y_test, predictCNB):.2f}")

pipeSVC.fit(X_train, y_train)
predictSVC = pipeSVC.predict(X_test)
print(f"SVC: {accuracy_score(y_test, predictSVC):.2f}")

MNB: 0.95
CNB: 0.98
SVC: 0.99

각 분류기마다 점수에 약간의 차이가 있었으며, 여기서 가장 정확도가 높은 pipeSVC를 스팸 분류기로 선택을 해준다.

새로운 문장을 예측을 시켜보면 spam이라고 나온 것을 확인할 수 있었다.

msg = "you have won a $10000 prize! contact us for eh reward!"
clsf = pipeSVC.predict([msg])
print(clsf[0])

> spam

참고 :

https://yumdata.tistory.com/383

https://zephyrus1111.tistory.com/254

https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

https://www.youtube.com/watch?v=eOu-h_XxjHQ&t=1713s

728x90

HackLog

sklearn - 스팸 메세지 분류(spam-text-message-classification)

티스토리툴바