um[wׂvbg[ɂ@BwKAڂ̑2BHŖ𗧂APythonCů{IȎgpƂāAf[^̓ǂݍ݂ƉHipandasgpjAlvZiNumPygpjƃf[^iMatplotlib^seaborngpjA@BwKiscikit-learn̎gj܂ł̌ȂwڂB
2024N819ŐVColabŁAL̑SẴR[hɓ삷邱Ƃ܂B
@ÓA@BwK̊bƁAvPythonCůTv܂B
@́APythong@BwKvO~O̊{IȗAۂɃR[hȂ̌IɊwł܂傤B̓Iɂ́Af[^̓ǂݍ݂ƉHAOtɂAvIȐlvZAĊȒPȋ@BwKf̍\z܂ŁA{IȈǍ̗ł܂i}1jB
@}1̒ʂA@BwKvO~O̊{IȗɉĐi߂ƁA1ŏЉvPythonCuipandasANumPyAMatplotlibAseabornAscikit-learnȂǁjeʂŎg邱ƂɂȂ܂B
@eCu[ĎgȂ߂ɂ́AʂɏڂwԂƂKvłB{Aڂł́Aڍׂɂ͐GꂸAHŖ𗧂{IȎgpɍiĐ܂BƐ[@艺ĊwтĺAwPythonf[^xAǂ邱Ƃ߂܂B
@܂Ƃ߂ƁA͐}2ɎewԂƂł܂B
@ł́A܂͍gpf[^̏Љn߂Ă܂B
@u@BwK͓vƎvĂ܂H@Sz͗v܂B̘Aڂł́Aum[wׂvbg[ɁA@BwK̊bƊe@}ƊȌȐŕ₷܂BPythongHK܂̂ŁA̎ƂŎpIȃXLgɕt܂B
@́A̓Iȋ@BwK̎@iF`AAAk-meansȂǁjĂ܂Bȍ~̐VLȂ悤ɁAЈȉ̃[ʒm̓o^肢܂B
@́A߁iIrisjƂԂf[^ZbgiDatasetFf[^̏W܂jg܂izzFhttps://doi.org/10.24432/C56C76ACZXFCC BY 4.0jB
@Irisf[^ZbǵA@BwK̊{IȗwԏŗzIȓĂ܂Bϐ4ƃVvŁAf[^150ƏȂȂKv\łBŜeՂɔcłK͂Ȃ̂AS҂Ƀsb^Ȃ̂ō̗p܂B
@@BwK̏SҌ`[gAł悭gĂ̂Łu܂ccvƎv邩܂A@BwK̊{Iȗ̌ł悤ɏAWĂ܂̂ŁAVȋCŎgł炦ƂꂵłB
@Irisf[^Zbg̐ϐij́A
4ڂƂȂĂ܂BȂAЂԂтuԁi͂ȁjv\vfŁAԂی삷uӕЁiւjv͉Ԃ̈ԊOɂAuԕفi͂ȂтAׂjv͂Ђɂ܂i}1jB߂́AЂɔłˁB
@Irisf[^Zbg̖ړIϐi^[QbgAxj́A߂̎ށiNXFClassjłB̓Iɂ́A
3ނ܂Bꂼ50ŁAv150łB
@̋@BwKł́Af[^i4ڂ̓ʁjɁA߂̎ނ\邱ƂڕWƂȂ܂B܂肱́Aޖ^XNłB
@ł́Ãf[^ZbgPythonœǂݍł݂܂傤B
@{Aڂ́A1Ő悤ɁÃNEhuGoogle Colabv̗pOƂĂ܂B{Iɂ́AColabŐVKm[gubNAȍ~ŐR[h͂Ȃsʂ̖ڂŊm߂ĂBɓ͍ς݂̃m[gubNgꍇ́ÃTvm[gubNpB
@PythonCSVt@CExcelt@C̃f[^ǂݍނȂApandasƂCuƂĂ֗łiQlLjBĂяoꔭŃf[^ǂݍ߂ĕ֗Ȃ̂ŁA@BwK̎Hł悭gĂ܂BȂA@BwKɎgePythonCu1Ő̂ŁA͑SĐ܂B
@pandasɂ́ACSVt@Cpread_csv()AExcelt@Cpread_excel()pӂĂ܂Bt@CVXẽpXAC^[lbgURLȂ1Ɏ܂BāAURLw肵ăC^[lbgɂIrisf[^Zbgǂݍނɂ́AXg1̂悤ɏĂił́AqOۂɎĂ݂邽߂ɁAf[^ς̂M҂GitHub|WgŔzzĂ܂jB
import pandas as pd
# f[^̓ǂݍ
url = 'https://raw.githubusercontent.com/isshiki/machine-learning-with-python/main/02-scikit-learn/iris_processed.csv'
df = pd.read_csv(url)
# f[^̊mF
df.head()
@pandasC|[gƂ̕ʖ́AʏApdƂ܂Bpd.read_csv(url)œǂݍf[^́ADataFrameif[^t[jƌĂ2i\`jf[^̃IuWFNgƂĕϐdfɊ蓖Ă܂B
@ɓǂݍ߂ǂmF邽߂ɁAdfIuWFNghead()\bhĂяoĂ܂BɂAǂݍf[^̐擪5s\܂i}4jB
@Sso͂ƕ\ɂȂ̂ŁÂ悤ɐ擪5s\Ă܂B@BwKŕpɂɎgeNjbN̈łB
@f[^pӂłA@BwKn߂悤ccƂƂɂ́AʏAȂ܂Bf[^mFɎĝ͊댯łBɎ͂ꂽf[^̏ꍇAꕔ̒lē͂ꂽAĂ肷邱Ƃ悭܂B
@Ⴆ0.0011.0ƌ͂Ȃǂُli͑̒lƑ傫ႤOlj܂܂ĂA\f[^̈ꕔA܂l܂܂Ă肷邱Ƃ܂BႦΑOf̐}4ɂ\f[^3s2ڂ̃ZɁuNaNviNot a NumberFjƕ\Ă܂ÁulvӖ܂B
@܂A}4́mClassnɂ́usetosavƂ\Ă܂Bscikit-learn͐l݂̂̂ŁAJeS[lij́AOɐliPythonint^float^jɒuĂKv܂B
@ُl⌇l̏AJeS[l̐lւ̒uȂǁAOɃf[^f[^͂@BwKɓK`ɐƂOi܂FPreprocessingjƌĂ܂BȉɁA\IȑO̍Ƃ܂Ƃ߂Ă܂B
@@BwK̐Eł́ȂOiƌq̃RɏʃGWjAOjɑ唼̎ԁiɂ8j₷ƌĂ܂BA̎Ԃɂ܂ɒJɍsƂŁA@BwKf̐\啝Ɍシ\܂B
@ł́Aقǂ̐ƏԂOサ܂AuJeS[l̐lւ̒uvul̏vuُl̏v̏őOĂ݂܂傤B
@OƈꕔdƂɁAʃGWjAO܂BO̓f[^ꂢɐ邱Ƃɏœ_ĂĂ̂ɑAʃGWjAOiFeature Engineeringj́u@BwKf̐\コv߂ɐVʁiϐjoƂɏœ_ĂĂ܂B̓Iɂ́AɈȉ̍Ƃs܂B
@ʃGWjAOsɂ́A܂f[^[Kv܂B̂߂ɁA܂܂PythonCugăf[^͂肷킯łB
@܂Af[^ޕɂmihCmƌĂ܂jKvłBႦΖ싅f[^Ȃuqbgvuz[vȂǂ̒m͕słB̒mɂāAf[^Ӗ̂ʂo邩łB
@ł́ADataFrameipdIuWFNgj́mClassnɊ܂܂镶̃JeS[lAint^̐lɒuĂ݂܂BR[h́AXg2̂悤ɏĂB
# JeS[l𐔒lɃ}bsO
class_mapping = {'setosa': 0, 'versicolor': 1, 'virginica': 2}
df['Class_ID'] = df['Class'].map(class_mapping)
# f[^̊mF
df.head()
@̃R[hɂAsetosa^versicolor^virginicáAꂼ0^1^2ƂlɒuAVȁmClass_IDnDataFrameɒlj܂BϊO̒lQƂł悤ɁÁmClassncĂ܂B}5͎sʂłB
@@BwKł́AʏAJeS[l0i1jn܂AԂ̐lɒu肵܂i̎@xGR[fBOƌĂ܂jBӖJeS[iႦu5iK̖xvȂǁj̏ꍇ́ȀɉĐl蓖Ă܂i̎@GR[fBOƌĂ܂jBȂ݂ɁȂɂAzbgGR[fBOȂǂ̎@܂A{eł͐܂B
@JeS[lAԂ̐lɌIɕϊɂ́Amap()\bh֗łB́ApandasSeriesiV[YA̗łdf['Class']Ŏ擾mClassn\1f[^̃IuWFNgjɊ܂܂郁\bhŁAȗgݍ킹i}bsOjdictIuWFNgi̗łclass_mappingjŎāAɏ]ϊĂ܂B
@JeS[l܂ޗ̃f[^́AJeS[ϐJeSJf[^ƂĂ܂B
@pandas DataFrame̊eɁulv邩ǂmFĂ݂܂傤BR[h̓Xg3̒ʂłB
# eɂ錇l̐mF
df.isna().sum()
@isna()\bh́ANaNȂǂ̌lzvfTrueAȊO̔zvfɂFalseݒ肵AVDataFrameԂ܂B
@sum()\bh́ADataFrame̗PʂōvlԂ܂B
@}6sʂłBbooll𐔒lɂƁATrue1ŁAFalse0Ȃ̂ŁAeuvlv͂̂܂܁ul̐vӖ邱ƂɂȂ܂B
@mSepal WidthnɁA1̌l邱ƂmFł܂BOf̐}4ɂNaNłˁB
@lɑΏɂ́AɈȉ̕@܂B
@̗ł͌l1ȂAu1s폜Ă@BwǨʂɑ債e͂Ȃvƍl̂ŁAul̂s̍폜vsƂɂ܂iXg4jB
# l̂s폜
df_dropped = df.dropna()
# ʂ̏o
print('f[^̍sF ', len(df))
print('lς݃f[^̍sF ', len(df_dropped))
@dropna()\bh̎gɂĕ⑫ƁAftHgł͌l܂ލs폜܂Aaxis=1w肷Ɨ폜܂B̗ɑ̌l܂܂ĂꍇiႦÃ̗f[^̔ȏオĂȂǂ̏ꍇjA̗̍폜ĂB
@Pythonlen()̈DataFramei̗łdf_droppedȂǁjnƁA̍s܂B}7͂̎sʂŁAm1sĂ邱ƂmFł܂B
@Xg4df.dropna()Ƃ͈ɂ܂BႦAdf[~df.isna().any(axis=1)]ƏƂł܂iR[he̐͊܂jBȍ~̃R[hA܂ŏ̈ɂȂ̂łӂB
@ɁApandas DataFrame̊eɁuُlviuOlvj邩ǂmFĂ݂܂傤BُĺAf[^OtƂĉƈڂ傤RłB
@ُľoɖ𗧂OtƂẮAႦΈȉ̂̂܂B}̏ڍׂ́ANȂǂmFĂB
@͍ł{IȔЂ}`悵Ă݂܂傤iXg5jB
import matplotlib.pyplot as plt
# 4̓ʁiϐjI
df_features = df_dropped.loc[:, ['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width']]
# Ђ}\
df_features.boxplot()
plt.show()
@PythonŊ{IȃOt`悷ɂ́ACuuMatplotlibvg܂B̃Ot`惂W[łmatplotlib.pyplotC|[gƂ̕ʖ́AʏApltƂ܂B
@ł́A4̓ʂɑĔЂ}쐬邱Ƃɂ܂BŁADataFrameʂ𒊏o܂Bɂ́Asf[^̈ꕔ𒊏ołloc\bhg܂B
@Xg6ł́Aloc\bḧɑ:őSsA["Sepal Length", "Sepal Width", "Petal Length", "Petal Width"]4̓ʂw肳Ă܂Bdf_featuresϐɂ4̓ʂ܂ސVDataFrame蓖Ă܂B
@DataFrameboxplot()\bhĂяoƁAMatplotlibgЂ}쐬܂B
@쐬ꂽOtm[gubNɕ\ɂ́Aplt.show()\bhĂяo܂B}8͂̎sʂłB
@mSepal Widthnɂ͏Ȋۂ\ĂAiЂ}̊ŁjOl悤łAɒ[ɗꂽꏊɂ͕\ĂȂ̂ŋe͈͂ł傤Bُ͈l͂Ȃ̂ƂāÂ܂ܑSẴf[^g悤ɂ܂B
@ُlꍇ́Aقǂ̌lƓ悤ɍ폜邩⊮ȂǁAɉĔf܂傤B{Iɂ́Aُl̂s폜̂߂łBOluُvł͂ȂԂfĂƍlꍇ́Â܂ܗp̂KłB
@OtłȂAvʂ̐lłdɊmFĂƈSłBXg6̃R[hłB
# Iʂ̊bvʂ\
df_features.describe()
@4̓ʂ܂DataFramedescribe()\bhĂяoƁA}9̂悤ɓvʂ܂Ƃ߂ĕ\܂B֗łˁB
@vʂ邱ƂŁAf[^̑S̑cAُl݂̑邱Ƃł܂BɁAŏliminj1lʐi25jA͍őlimaxj3lʐi75j傫Ăꍇ́Aُl̉\ƍl܂B
@ɂ́Aϒlimeanjƒli50j傫ijĂAWistdjُɑ傫A܂f[^LoĂ肷ꍇ́Af[^ُ̕l݂̑Ă\܂B
@قǂُ͈ľoɖ𗧂OtƂĔЂ}Uz}ȂǂЉ܂AMatplotlibō쐬ł܂BseabornCup邱ƂŁA葽lȃOt쐬ł悤ɂȂ܂B̈Ⴊur[XEH[}vłB
@ł̓r[XEH[}쐬Ă݂܂傤B܂̓Xg7̃R[hsāAseabornCuCXg[܂B
! pip install seaborn
@ɁAr[XEH[}`悵܂iXg8jBseabornMatplotlibx[X̃CuȂ̂ŁA͂matplotlib.pyplotW[̃C|[gi{IɁjKvłB
# seabornCũC|[g
import seaborn as sns
import matplotlib.pyplot as plt
# seabornŃr[XEH[}쐬
sns.swarmplot(data=pd.melt(df_features), x="variable", y="value", size=2.5)
plt.show()
@CuuseabornvseabornW[C|[gƂ̕ʖ́AʏAsnsƂ܂BȂ݂Ɂusnsv́Aerh}̓oluSamuel Norman SeabornvɗR邻łB
@r[XEH[}쐬ɂ́Asns.swarmplot()\bhĂяo܂Bdatapd.melt(df_features)w肵Ă܂Aswarmplot()\bhuO`v̕\`f[^iTabular dataFsƗ̌`Őꂽf[^jOƂĂ邽߂łB
@O`iLong-formjƂ͕\`f[^ucv`ŁA܂Ch`iWide-formjƂ́uv`Ő@łiQlLjBႦIrisf[^ZbǵAes1̃f[^|CgŁAeʁiЂ̒AЂ̕AԂт̒AԂт̕jɂȂĂ郏Ch`̕\`f[^łB
@O`ɕϊꍇAeʂʂ4̗uʂ̎ށvƂ1̗ɂ܂Ƃ߁Aulv1̗ŕ\܂i}10jB
@̂悤ȃCh烍Oւ̌`ΐApd.melt()\bhĂяołł܂BftHgŁuʂ̎ށvvariableAulvvalueƂOɂȂ܂Bsns.swarmplot()\bhx="variable", y="value"Ƃ̖͂Ow肵Ă킯łB
@size=2.5́Avbg_if[^|Cgj̑傫w肵Ă܂Br[XEH[}́A_d˂ɕ\dlłBf[^Ɠ_dȂA\bhĂяoɌx\܂B邽߂ɁA_̂Ă܂B
@Ƃ͐قǂƓlplt.show()\bhŁAm[gubNɕ\邾łi}11jB
@̃r[XEH[}ɂ́A傫ꂽf[^|Cg͂ȂAُl݂Ȃ̂mFł܂B
@seabornpairplotLpȂ̂ŏЉ܂Bg͊ȒPȂ̂Ősvł傤iXg9jBȂ݂ɁApandasɂl̂Ƃłscatter_matrix()݂܂B
sns.pairplot(df_features)
plt.show()
@̊́ADataFrame̊eʂŃyAđ̃Otxɍ쐬Ă܂B4̓ʂꍇA16̐}i4~4̃}gbNXj쐬܂i}12jB
@ʊԂ̊Weʂ̕zڂŊmFł܂BΊp̃ZiF1s1ڂ́mSepal Lengthnm̃yAjł́ueʂ̕zvqXgOŁȂ̃ZiF1s2ڂ́mSepal LengthnƁmSepal WidthñyAjł́u2̓ʊԂ̊WvUz}ŕ`悳܂B
@̂悤pairplot𗘗pƁAʊԂ̑ւAž`Aُl̗LȂǁAf[^Zbg̊Tvfcł܂B
@@BwKvWFNgł́AiOɂĕiサjf[^Zbg[邱ƂAʂ̑InCp[p[^[iPOɐlԂɂĎw肷ݒ荀ځj̃`[jOiĵ߂ɔɏdvłB̍ƂTIf[^iEDAFExploratory Data AnalysisjƌĂ܂ȀiKł̊𗧂܂B
@Of̃Xg6łpandasgēvʂ߂܂BAɑʂ̃f[^ŐlvZzۂɂ́ApandasNumPy̕ǂꍇ̂ŁANumPy߂łB
@lvZCuuNumPyvndarrayƌĂzf[^́ApandasDataFrameɔׂāAANZX␔lvZ荂łB́AndarrayȋSĂ̒ljlȃf[^^ō\Ax̍œK{Ă邽߂łB
@ɕM҂ColabőxrƂA2f[^iNumPy ndarray vs. pandas DataFramejł́ANumPypandasuzvfւ̃ANZXv͖200{Auς̌vZv͖30{ʂA܂1f[^iNumPy ndarray vs. pandas Seriesjł́ANumPypandasuzvfւ̃ANZXv͖1000{Auς̌vZv͖8{Au|ŽvZv͖100{ʂo܂iQlFTvm[gubNŎ܂jBAʂ͎sf[^eɂĕς܂̂łӂB
@pandaśA܂łɐf[^̑O╪͂̏iKŕ֗łBpandasNumPy̗CúApV[œKɎgƂ悢ł傤B
@ł́ApandasőOsf[^ndarrayɕϊAςvZĂ݂܂iXg10jB
import numpy as np
# pandasDataFramendarrayiNumPyzjɕϊ
features_array = df_features.to_numpy(dtype='float32')
# NumPyŊe̕ϒlvZ
mean_features = np.mean(features_array, axis=0)
mean_features # o͗F array([5.8510065, 3.0563755, 3.7744968, 1.2060403], dtype=float32)
@numpyW[C|[gƂ̕ʖ́AʏAnpƂ܂B
@pandasto_numpy()\bhŁADataFrameNumPyndarrayɕϊł܂BndarrayDataFrameƈقȂAlȃf[^^ɓꂷKv邽߁Adtype='float32'w肵ĖIɃf[^^ꂵĂ܂B
@ɁANumPy̐lvZ\bḧƂnp.mean()\bhĂяoĂ܂BvZʂ́Apandasg}9meanƓłB
@@BwKł́Aeʁif[^j̃XP[iPʁjꂷƂłKiNormalizationjsƂŁAwǨ@BwKf̐\オ҂ł܂BK̑\IȎ@̈WiStandardizationjłB
@ẂAeʂŕρi܂f[^̕z̒Sju0vɁAWi܂蕽ς̃f[^̂ju1ṽXP[ɕϊiXP[Oj@łB̌vŹAf[^畽ϒlŁAWŊ邾iF(data - mean) / stdjȂ̂ŁANumPypandasłȒPɎł܂B
@scikit-learngƊȒPłBsklearn.preprocessingW[ɗpӂꂽA܂܂ȐK̂߂̃NXpł܂BWɂ́AStandardScalerNXg܂iXg11jB
from sklearn.preprocessing import StandardScaler
# scikit-learnőSĂ̓ʂW
scaler = StandardScaler()
sk_scaled = scaler.fit_transform(features_array)
sk_scaled[:2] # 擪2s\
# o͗F
# array([[-0.9128386, 1.0181674, -1.353994 , -1.3275825],
# [-1.1559356, -0.1293889, -1.353994 , -1.3275825]], dtype=float32)
@fit_transform()\bhɂAf[^ϊȉꍇ͕Wj܂BẂAf[^Zbg̑Sʁi̗łfeatures_arrayjɑĂ܂Ƃ߂čŝʓIłB
@ӓ_ƂāAscikit-learnNumPypandasł́AW̌vZ@ɈقȂ\܂Bscikit-learnł̓ftHgŕWc̕WgpANumPypandasł͈ddof=1w肷邱ƂsΕWiW{Wc̕W𐄒肵ljIł܂Bddof=0w肷ƁAscikit-learnƓWɂȂ܂B
@̗ƂāANumPypandasł(data - data.mean(axis=0)) / data.std(axis=0, ddof=0)ƂR[hŕWł܂Baxis=0́Ai܂ʂƁǰvZӖ܂iQƁFNumPỹwvApandas̃wvjB_ȉ̊ۂߌ덷ȂǂɂAscikit-learnNumPypandašvZʂ́ASɈvȂƂ܂B
@̂悤Ȑl̕ϊ@ƒm肽ꍇ́A̋ĹuʃGWjAOv\mlϐn̐QlɂĂB
@܂ŁAuf[^̓ǂݍ݁vAlُl̏ƂuOvsAf[^̕iコĂ܂B܂ApIɂȂ܂Af[^Ȃǂɂf[^Zbg[uTIf[^́vƁAʂ̐VK쐬IȂǂsuʃGWjAOvɂĂARŏЉ܂B͂A@BwKf̐\コ邽߂ɏdvȍƂłB́A悢@BwK̒iKɓ܂B
@@BwKfPOɁAf[^Ppf[^ZbgiTraining setFPZbgjeXgpf[^ZbgiTest setFeXgZbgjɕ邱ƂʓIłB́AfwKɎgpĂȂum̃f[^vɑĂǂꂾ܂\ł邩AȂ킿fĉ\i͂̂j𐳊mɕ]邽߂łB
@f̌PvZX番ƗeXgZbgpӂ邱ƂŁAuE̖m̃f[^ɑ郂f̐^̐\viĉ\jcł悤ɂȂ܂BɂAfPf[^ߏɓKiuOver-fittingFߊwKvƂĂ܂jĂ܂oł悤ɂȂ܂B
@ṓAf[^90PZbgɁA10eXgZbgɊ蓖Ă܂i}13jB̊ɖmȊ͂܂A̓f[^149ƏȂ̂ŁAeXgZbg͏Ȃ߂ɂ܂BʓIɂ80F2070F30悤łBPZbg𑽂ƊwK₷ȂAeXgZbg𑽂Ɣĉ\Kɕ]₷Ȃƍl܂B
@scikit-learnCugAf[^ȒPɍs܂B̓Iɂsklearn.model_selectionW[train_test_split()ŁAf[^ZbgPpƃeXgpɕł܂iXg12jB
from sklearn.model_selection import train_test_split
X = df_features # ̓f[^iXFʁjFf֓͂ϐ
y = df_dropped['Class_ID'] # liyFxjF\ړIϐ
# f[^PZbgƃeXgZbgɕieXgZbg͑Ŝ10ɐݒj
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
# PZbg̓ʂƃxāA擪5s\
pd.concat([X_train, y_train], axis=1).head()
@̓f[^ƂȂ4̓ʂXƂϐɁAfɂ\ʂƔr邽߂̐lƂȂ郉xiړIϐjyƂϐɊ蓖ĂĂ܂B́AwӎŁA啶X͍sf[^i2̔zf[^łDataFramejAy̓xNgf[^i1̔zf[^łSeriesjӖ܂B
@train_test_split()̈ɂ́AXyɉāAtest_size=0.1w肳Ă܂B̓f[^10i0.1jeXgZbgɂ邱ƂӖ܂B
@Of̐}3Ō悤ɁAIrisf[^Zbg͐擪SāusetosavɂȂĂȂǁȀŕłAf[^Ă܂B̂߁Af[^̃VbtsłBtrain_test_split()́AftHgshuffle=Trueݒ肳ĂAw肵ȂĂIɃf[^Vbt܂B
@random_state=42́AV[hƌĂA_ɃVbtۂ̊ƂȂ鐔lłB̗V[hlw肷邱ƂŁAʂ̍Čۏ܂B܂AV[hgp邱ƂŁA̐lR[hsۂɂAf[^̕ʂ悤ɂȂ܂B
@V[hɂ42Ƃl悭g܂Bw̓qb`nCNEKChxi_OXEA_Xj̒ŁAX[p[Rs[^750Nāu^F^ɂĂ̋ɂ̋^ɑ铚vvZʂ42łB̐ĺAȊwZp̕ł悭pAy݂Ȃʂ̍Čmۂ邽߂̈̓`ƂȂĂ܂B
@train_test_split()́APZbg̓ʁiX_trainjAeXgZbg̓ʁiX_testjAPZbg̃xiy_trainjAeXgZbg̃xiy_testjƂ4vf̃XgԂ܂B
@Ōpd.concat()ŁAPZbg̓ʂƃx1DataFrameɘAĂ܂Baxis=1͗ɘA邱ƂӖ܂B
@@BwKfœKA̐\őɈo߂ɂ́AKȃnCp[p[^[̃`[jOsłB`[jOɂ́Af[^PZbgAZbgiValidation setFؗpf[^ZbgjAeXgZbg3ɕKv܂B
@nCp[p[^[̃`[jOɃeXgZbggpƁAeXg̃f[^umvłȂȂĂ܂܂B܂ĉ\Kɕ]łȂȂ܂BāAeXgZbg͎g킸Ɂumv̂܂cāA`[jOɎgZbgVɕKvɂȂƂ킯łB
@@BwK̉Lł́AZbgpӂĂȂꍇłA̓nCp[p[^[`[jOȂ߂ƍl܂B{AڂłA{Iɂ3ɁAPZbgƃeXgZbg2ōς܂悤ɂ܂Bۂ̃vWFNgł́AZbg܂߂3܂B
@Zbǵ̕AeXgZbgƓ炢ɂ̂ʓIłBɏ]́APZbg80AZbgƃeXgZbgꂼ10̊ŕ܂i}14jB
@́AeXgZbg10ɕς݂Ȃ̂ŁAPZbg炳ɌZbg܂B90̂10̂train_test_split()̈ɂtest_size=1/9Ǝw肵܂iXg13jB1/9i1090jƕ`ɂĂ̂́A0.111...ƂlɂƈӐ}ȂƍlłB
# PZbg炳ɌZbgiZbgŜ10ɐݒj
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=1/9, random_state=42)
@X_validZbg̓ʂŁAy_validZbg̃xłBȏŁAf[^̏܂B
@ŒIɌPZbg^Zbg^eXgZbg3i͌PZbg^Zbg2jāA@BwKf̐\]@z[hAEgiHold-out ValidationjƌĂт܂B
@z[hAEǵAŕ₷łÂ悤ɏȂf[^̃f[^ZbgƁÅeZbg̃f[^ɏȂȂĂ܂肪܂B܂AuVbtĂvƂ͌ĂAŒIȃf[^ŌP]邱ƂɂȂ̂ŁAwK]ʂɂȂ\ےł܂B
@łuSẴf[^v]ƂȂgČPA肪Ȃ悤oXǂ]邱Ƃ]܂ł傤B\ɂ̂AiCVFCross ValidationANXof[VjłBłAŏI]̂߂ɎOɃeXgZbg͕ĎcĂ悤ɂ܂傤B
@́AɃf[^ʂȂꍇɗLłAvZRXg]vɂfbg܂B_ŕ̂Ȃf[^ʂɂȂAz[hAEgŏ\łB
@̈@k-fold܂B̎@ł́Af[^Zbgk̃tH[hifoldFjɕA1̃tH[hZbgƂāAck|1̃tH[hPZbgƂĎgp܂i}15jB̃vZXkJԂAetH[hx͌ZbgƂėp悤ɂ܂B
@}15k5̏ꍇŁAuf1`5vƕ\悤5̋@BwKfŌPƌi]jłBႦuÁṽfȂAقȂ5̌Pς݃f쐬܂BefZbgŕ]āA̕ς邱ƂŁAIȃf̐\]܂B
@́AKȃnCp[p[^[lȋgݍ킹jTړIŖ𗧂܂Bscikit-learnɂ́A̖ړIŎgsklearn.model_selectionW[GridSearchCVNXpӂĂ܂iQlLjBɂ茩œKȃnCp[p[^[lŁASẮuPZbg{ZbgvgčČP1̋@BwKf쐬̂AʓIȎp@̈łB
@܂Ak-foldō쐬k̃fSĎgAeXgZbgɑk̗\l擾āAς邱ƂȂǁiATuwKƌĂ܂j1̗\lƂ邱ƂAʓIȎp@̈łB̕@́A@BwK̋ZłKaggleRyeBVł悭̗pĂ܂Bۂ̋LŎĂ̂ŎQlɂĂB
@łA悢Ō̃XebvłB́A@BwKɋʂ闬Љ܂Bʂ̋@BwKASY̓eɂẮAȍ~ŌʂɏڂĂ܂B
@@BwKvWFNgł́A@BwK̃ASYiFÁjIAPZbgpċ@BwKfPifitjAeXgZbgpČPς݃fɂ\ipredictjs܂Bscikit-learnCugāÅ{IȗȌɗ܂傤BȂÃvWFNgł́APɌZbgpănCp[p[^[̃`[jOs܂A܂ߖ{Aڂł͊{Iɏȗ܂B
@͋@BwK̃ASYƂāAf[̎dɎgꂽƂŗLiC[uxCYފiPxCYފjgp܂Bscikit-learnł́Asklearn.naive_bayesW[GaussianNBNXƂāA̋@\Ă܂B
@܂́A@BwK̃ASY@I܂iXg14jB
from sklearn.naive_bayes import GaussianNB
# @BwK̃ASYI
model = GaussianNB(var_smoothing=1e-9) # iC[uxCYފ
@ɂvar_smoothingp[^[́AiC[uxCYފɂ镪U邽߂̃nCp[p[^[łBɌPZbgȂꍇɋNߊwKiߏKj̖hAf̈萫コ邽߂Ɏgp܂BftHgl1e-9i090.000000001jłB
@ɁAPZbg@BwKfimodeljɓ͂āAfP܂iXg15jB
# PZbg͂āA@BwKfP
model.fit(X_train, y_train)
@scikit-learnł͊{IɁAPmodel.fit()\bhōs܂Bfit()\bhɓ͂ĂPZbǵApandas DataFrameX_trainSeriesy_trainłBscikit-learn̓ł́A{INumPyndarraygĂ܂Ao͂ɂpandasDataFrameȂǂT|[gĂ܂B
@ɁAeXgZbg@BwKfimodeljɓ͂āAPς݃fŗ\܂iXg16jB
# eXgZbg͂āAPς݃fŗ\
pred_test = model.predict(X_test)
pred_test
# o͗F array([1, 0, 2, 1, 2, 0, 1, 2, 1, 1, 2, 0, 0, 0, 0], dtype=int64)
@scikit-learnł͊{IɁA\model.predict()\bhōs܂Bpredict()\bhɓ͂ĂeXgZbǵApandas DataFrameX_testłB̂悤ɕ̃f[^܂Ƃ߂ė\ł܂B
@predict()\bh̖߂lƂāANumPyndarray1̔zf[^ԂĂ܂Bɂ́A\ʂeXgZbg̃f[^Ɋi[Ă܂B
@̗\ʁipred_testjƃeXgZbg̐liy_testj킹āA̐𗦁iAccuracyj]܂iXg17jB
from sklearn.metrics import accuracy_score
# eXgZbgɂA\ʂ̕]
print(f'Accuracy:{accuracy_score(y_test, pred_test)}')
# o͗F Accuracy:0.9333333333333333
@scikit-learnł́A̕]sklearn.metricsW[accuracy_score()ōs܂BȂA]wWɂ͐𗦈ȊOɂKF1XRAȂǑɂ܂܂Ȃ̂܂iQlLjB
@accuracy_score()̖߂lƂāAfloatlԂĂ܂B0.9333...́u93v̐𗦂Ӗ܂B́Aɍ𗦂ɂȂ܂B
@eXgOɂnCp[p[^[`[jOɂ́AXg14ɂnCp[p[^[var_smoothinglςāAXg15̌PsĂ݂ĂBXg16`17Ɠ@ŌZbgiX_validy_validjgāA\A]܂B𗦂荂lɂȂ悤ɁA̎菇JԂƂōœKȃnCp[p[^[lo܂B
@Ȃ݂ɌZbggāA\A]Ɛ𗦂100ł̂ŁAIrisf[^ZbgƃiC[uxCYފ̑gݍ킹ł́AnCp[p[^[`[jO]n͂قƂǂȂłB
@́APythonɂ@BwK̊{Iȗ܂B̓eɃR[hĂ邱ƂD܂łBMȂAЉxŏ蒼Ă݂ĂB܂A]TA͎̎Ă݂ĂB
@HIȋ@BwK̎菇wтluwKaggleu~RyQxŋ@BwKn߂悤vƂAڋLǂ邱Ƃ߂܂B
@́A̓Iȋ@BwK̎@iF`AAAk-meansȂǁjĂ܂Be@BwK̃vO~Oɋʂê͍ŁA炻Əd͊{Iɏȗ܂B
@͐`APythonŃvO~OĂ݂܂By݂ɁB
@IWF̕NbN܂̓^bvƓ\܂Bqg~ꍇ́AΐF̕NbNĂBߖɎgI\܂B
@@BwKł́A܂PythoñCuupandasvread_csv()găf[^ǂݍ݁ADataFrameƌĂ2i\`jf[^̃IuWFNgƂĎ擾܂B
@ǂݍf[^ɂ́Aُl⌇l̏AJeS[l̐lւ̒uȂǁAOɃf[^f[^͂@BwKɓK`ɐOsłB
@ُlOĺAf[^OtƂƈڂ傤RłBPythonŊ{IȃOt`悷ɂ́ACuuMatplotlibvg܂B
@@BwKvWFNgł́Af[^Zbg[邱ƂdvłB̍ƂTIf[^iEDAjƌĂ܂B́AʂI쐬肷ʃGWjAOɂ𗧂܂B
@荂xȃOt̕`ɂ́AʊԂ̊Weʂ̕zɊւ鑽̃Otxɍ쐬łpairplot()Ȃǂ郉CuuseabornvLpłB
@ȐlvZɂNumPyndarrayƌĂzf[^KĂ܂BCu͖ړIɉĎĝ߂łB
@f[^ZbǵAPɎguPZbgvƁAnCp[p[^[̃`[jOɎguZbgvƁAĉ\]邽߂ɎgueXgZbgv3܂傤B
@@BwK̊{Iȗł́A@BwK̃ASY@IAfPAPς݃f\ʂƐlׂĐ\]܂B
qgF @TableFrame@@ndimdata@@TIf[^́@@pandas@@\@@@@scikit-learn@@mFZbg@@Zbg@@seaborn@@Dask@@Bokeh@@f[^}CjO@@O@@ndarray@@ʃGWjAO@@DataFrame@@@@Matplotlib@@ĉ\@@f[^pCvC@@t@C`[jO@
u@BwKv
Copyright© Digital Advantage Corp. All Rights Reserved.