๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๐Ÿค–๋จธ์‹ ๋Ÿฌ๋‹/sklearn

sklearn - ์ŠคํŒธ ๋ฉ”์„ธ์ง€ ๋ถ„๋ฅ˜(spam-text-message-classification)

by Janger 2024. 4. 8.
728x90
๋ฐ˜์‘ํ˜•

 

notebook

 

spam-text-message-classification.ipynb
0.01MB

 

Spam Text ๋ฐ์ดํ„ฐ์…‹

 

https://www.kaggle.com/datasets/team-ai/spam-text-message-classification

 

Spam Text Message Classification

Let's battle with annoying spammer with data science.

www.kaggle.com

 

 

Write-up

 

 

๋ฐ์ดํ„ฐ ๊ด€๋ฆฌ์— ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋“ค import

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

 

SPAM Text ๋ฐ์ดํ„ฐ์…‹ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ ๋ฐ ๋ฐ์ดํ„ฐ ์š”์•ฝ

df = pd.read_csv("/kaggle/input/spam-text-message-classification/SPAM text message 20170820 - Data.csv")
df.head()

 

์š”์•ฝ ๊ฒฐ๊ณผ ํ…Œ์ด๋ธ”์˜ ์ปฌ๋Ÿผ์€ Category, Message๋กœ ๊ตฌ๋ถ„์ด ๋˜์–ด ์žˆ๋‹ค. ๋ฉ”์‹œ์ง€ ๋‚ด์šฉ์ด ์ŠคํŒธ์ธ ๊ฒฝ์šฐ ์นดํ…Œ๊ณ ๋ฆฌ๊ฐ€ spam ์•„๋‹ ๊ฒฝ์šฐ๋Š” ham์œผ๋กœ ๊ตฌ๋ถ„

 

๋ฉ”์‹œ์ง€๋“ค์€ X ๋ณ€์ˆ˜์— ํ• ๋‹น, ์นดํ…Œ๊ณ ๋ฆฌ๋Š” y ๋ณ€์ˆ˜์— ํ• ๋‹นํ•œ๋‹ค. 

X = df['Message']
y = df['Category']
len(X)

 

 

train_test_split() : ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ ํ•™์Šต ๋ฐ ํ‰๊ฐ€ ๊ณผ์ •์—์„œ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•˜๋Š” ๋ฐ์ดํ„ฐ ๋ถ„ํ•  ํ•จ์ˆ˜

 

์ถœ๋ ฅ ๊ฒฐ๊ณผ๋กœ๋Š” ํ•™์Šต์— ํ•„์š”ํ•œ (X_train, y_train) ์ถœ๋ ฅ๊ณผ ํ…Œ์ŠคํŠธ์— ์‚ฌ์šฉํ•˜๋Š” (X_test, y_test)๊ฐ€ ๋ฆฌํ„ด๋œ๋‹ค. 

์ถœ๋ ฅ ๊ฒฐ๊ณผ๋Š” ๋žœ๋ค์ด๋‹ค. 

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=22)

 

 

๋‹ค์‹œ ๋จธ์‹ ๋Ÿฌ๋‹์— ์‚ฌ์šฉํ•˜๋Š” ํ•„์š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋“ค์„ import

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report

 

 

ํŒŒ์ดํ”„๋ผ์ธ(Pipeline) : ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ์—์„œ ํ•™์Šต๊นŒ์ง€์˜ ๊ณผ์ •์„ ํ•˜๋‚˜๋กœ ์—ฐ๊ฒฐํ•ด์ฃผ๋Š” ๊ฒƒ

https://medium.com/geekculture/how-to-use-sklearn-pipelines-to-simplify-machine-learning-workflow-bde1cebb9fa2

 

ํŒŒ์ดํ”„๋ผ์ธ ํ•จ์ˆ˜ ๊ตฌ์กฐ

 

ํŒŒ์ดํ”„๋ผ์ธ์€ ๋‘ ๊ฐœ๋กœ ์ด๋ฃจ์–ด์ง„ ํŠœํ”Œ์„ ๋ฆฌ์ŠคํŠธ๋กœ ์ธ์ž๋ฅผ ๋ฐ›๋Š”๋‹ค. 

pipeline( [ ('์ž‘์—…๋ช…1', ์ž‘์—… ํด๋ž˜์Šค 1), ('์ž‘์—…๋ช…2', ์ž‘์—… ํด๋ž˜์Šค 2) ] )

 

 

ํŒŒ์ดํ”„๋ผ์ธ์ด ์—†์–ด๋„ ํ•™์Šต์€ ๊ฐ€๋Šฅํ•˜์ง€๋งŒ ์ผ๋ จ์˜ ๊ณผ์ •๋“ค(๋ณ€์ˆ˜ ์„ ํƒ, ํ‘œ์ค€ํ™”, ๋ชจ๋ธ ํ•™์Šต)์„ ํ•˜๋‚˜ํ•˜๋‚˜ ์ฝ”๋”ฉ์„ ํ•ด์•ผ ํ•˜๋ฏ€๋กœ ๋ณต์žกํ•ด์ง€๋Š”๋ฐ ํŒŒ์ดํ”„๋ผ์ธ์„ ์‚ฌ์šฉํ•˜๋ฉด ์ผ๋ จ์˜ ๊ณผ์ •๋“ค์„ ํ•œ ๋ฒˆ์— ํ•ด๊ฒฐ ๊ฐ€๋Šฅํ•˜๋‹ค. 

 

ํŒŒ์ดํ”„๋ผ์ธ ์‚ฌ์šฉ ์˜ˆ์‹œ

 

์•„๋ž˜๋Š” ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ ํ•™์Šต ๋ถ€ํ„ฐ ์˜ˆ์ธก, ์„ฑ๋Šฅ ํ‰๊ฐ€๋ฅผ ๊ตฌํ˜„ํ•œ ์ฝ”๋“œ์ด๋‹ค. (์ถœ์ฒ˜ : https://zephyrus1111.tistory.com/254)

## ์ž‘์—… ๋“ฑ๋ก
pipeline = Pipeline([('Feature_Selection', SelectKBest(f_classif, k=2)), ## ๋ณ€์ˆ˜ ์„ ํƒ
 ('Standardization', StandardScaler()),  ## ํ‘œ์ค€ํ™”
  ('Decision_Tree', DecisionTreeClassifier(max_depth=3)) ## ํ•™์Šต ๋ชจ๋ธ
])

pipeline.fit(X, y) ## ๋ชจํ˜• ํ•™์Šต
print(pipeline.predict(X)[:3]) ## ์˜ˆ์ธก
print(pipeline.score(X, y)) ## ์„ฑ๋Šฅ ํ‰๊ฐ€

 

 

์ „์ฒ˜๋ฆฌ & ๋ถ„๋ฅ˜๊ธฐ ์„ ํƒ

 

 

TfidfVectorizer()๋Š” ์šฐ๋ฆฌ๊ฐ€ ์ฝ์„ ์ˆ˜ ์žˆ๋Š” ๊ธ€์ž๋ฅผ ์ž์—ฐ์–ด๋ผ๊ณ  ํ•˜๋Š”๋ฐ ์ด๋ฅผ ์ปดํ“จํ„ฐ๊ฐ€ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์‰ฝ๋„๋ก ๋ฐ”์ด๋„ˆ๋ฆฌ ํ˜•ํƒœ์™€ ๊ฐ™์€ ๋ฌธ์žฅ ๋ฒกํ„ฐํ™” ์ฒ˜๋ฆฌํ•˜๋Š” ์ „์ฒ˜๋ฆฌ ํ•จ์ˆ˜์ด๋‹ค. ํŒŒ์ดํ”„๋ผ์ธ ํ•จ์ˆ˜ ๊ตฌ์กฐ์—์„œ๋Š” ์ž‘์—… ํด๋ž˜์Šค 1์— ์œ„์น˜ํ•œ๋‹ค. 

 

๊ทธ ์™ธ์— MultinomialNB, ComplementNB๋‚˜ LinearSVC๋Š” ๋ถ„๋ฅ˜ ๋ฌธ์ œ์—์„œ ์‚ฌ์šฉ๋˜๋Š” ํด๋ž˜์Šค์ด๋ฉฐ  ํŒŒ์ดํ”„๋ผ์ธ ํ•จ์ˆ˜ ๊ตฌ์กฐ์—์„œ๋Š” ์ž‘์—… ํด๋ž˜์Šค 2์— ์œ„์น˜ํ•œ๋‹ค. 

 

pipeMNB = Pipeline([('tfidf', TfidfVectorizer(stop_words='english', ngram_range=(1,3))), ('clf', MultinomialNB())])
pipeCNB = Pipeline([('tfidf', TfidfVectorizer(ngram_range=(1,3))), ('clf', ComplementNB())])
pipeSVC = Pipeline([('tfidf', TfidfVectorizer(ngram_range=(1,3))), ('clf', LinearSVC())])

 

 

 

ํ•™์Šต(fit) ๋ฐ ์˜ˆ์ธก(predict)

 

์ด์ œ ๋งŒ๋“  ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ ํ•™์Šต(fit)๊ณผ ์˜ˆ์ธก(predict)์„ ํ•˜๊ณ  ์ •ํ™•๋„(accuracy score)๋ฅผ ์ถœ๋ ฅํ•œ๋‹ค. 

pipeMNB.fit(X_train, y_train)
predictMNB = pipeMNB.predict(X_test)
print(f"MNB: {accuracy_score(y_test, predictMNB):.2f}")

pipeCNB.fit(X_train, y_train)
predictCNB = pipeCNB.predict(X_test)
print(f"CNB: {accuracy_score(y_test, predictCNB):.2f}")

pipeSVC.fit(X_train, y_train)
predictSVC = pipeSVC.predict(X_test)
print(f"SVC: {accuracy_score(y_test, predictSVC):.2f}")

 

MNB: 0.95
CNB: 0.98
SVC: 0.99

 

๊ฐ ๋ถ„๋ฅ˜๊ธฐ๋งˆ๋‹ค ์ ์ˆ˜์— ์•ฝ๊ฐ„์˜ ์ฐจ์ด๊ฐ€ ์žˆ์—ˆ์œผ๋ฉฐ, ์—ฌ๊ธฐ์„œ ๊ฐ€์žฅ ์ •ํ™•๋„๊ฐ€ ๋†’์€ pipeSVC๋ฅผ ์ŠคํŒธ ๋ถ„๋ฅ˜๊ธฐ๋กœ ์„ ํƒ์„ ํ•ด์ค€๋‹ค. 

 

 

์ƒˆ๋กœ์šด ๋ฌธ์žฅ์„ ์˜ˆ์ธก์„ ์‹œ์ผœ๋ณด๋ฉด spam์ด๋ผ๊ณ  ๋‚˜์˜จ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. 

msg = "you have won a $10000 prize! contact us for eh reward!"
clsf = pipeSVC.predict([msg])
print(clsf[0])

> spam

 

 

 

์ฐธ๊ณ  : 

https://yumdata.tistory.com/383

https://zephyrus1111.tistory.com/254

https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

 

https://www.youtube.com/watch?v=eOu-h_XxjHQ&t=1713s

 

 

728x90
๋ฐ˜์‘ํ˜•