AI帶你省錢旅游!精準預測民宿房源價格!( 五 )


AI帶你省錢旅游!精準預測民宿房源價格!

文章插圖
gm_regression_df['bathrooms_text'] =gm_regression_df['bathrooms_text'].str.replace("private bath", "pb", case=False)gm_regression_df['bathrooms_text'] =gm_regression_df['bathrooms_text'].str.replace("private baths", "pbs", case=False)gm_regression_df['bathrooms_text'] =gm_regression_df['bathrooms_text'].str.replace("shared bath", "sb", case=False)gm_regression_df['bathrooms_text'] =gm_regression_df['bathrooms_text'].str.replace("shared baths", "sb", case=False)gm_regression_df['bathrooms_text'] =gm_regression_df['bathrooms_text'].str.replace("shared half-bath", "sb", case=False)gm_regression_df['bathrooms_text'] =gm_regression_df['bathrooms_text'].str.replace("private half-bath", "sb", case=False)gm_regression_df = split_bathroom(gm_regression_df, column='bathrooms_text', text='bath', new_column='bathrooms_new')gm_regression_df['shared_bath'] = gm_regression_df['shared_bath'].str.split(" ", expand=True)gm_regression_df['private_bath'] = gm_regression_df['private_bath'].str.split(" ", expand=True)gm_regression_df['bathrooms_new'] = gm_regression_df['bathrooms_new'].str.split(" ", expand=True)# 填充缺失值為0gm_regression_df = gm_regression_df.fillna(0)gm_regression_df['shared_bath'] = gm_regression_df['shared_bath'].replace(to_replace='Shared', value=https://www.huyubaike.com/biancheng/0.5)gm_regression_df['private_bath'] = gm_regression_df['private_bath'].replace(to_replace='Private', value=https://www.huyubaike.com/biancheng/0.5)gm_regression_df['bathrooms_new'] = gm_regression_df['bathrooms_new'].replace(to_replace='Half-bath', value=https://www.huyubaike.com/biancheng/0.5)# 轉成數值型gm_regression_df['shared_bath'] = pd.to_numeric(gm_regression_df['shared_bath']).astype(int)gm_regression_df['private_bath'] = pd.to_numeric(gm_regression_df['private_bath']).astype(int)gm_regression_df['bathrooms_new'] =pd.to_numeric(gm_regression_df['bathrooms_new']).astype(int)# 查看處理后的字段gm_regression_df[['shared_bath', 'private_bath', 'bathrooms_new']].head()
AI帶你省錢旅游!精準預測民宿房源價格!

文章插圖
下面我們對類別型字段進行編碼 , 根據字段含義的不同,我們使用「序號編碼」和「獨熱向量編碼」等方法來完成 。
# 序號編碼def encoder(df):for column in df[['neighbourhood_group_cleansed', 'property_type']].columns:labels = df[column].astype('category').cat.categories.tolist()replace_map = {column : {k: v for k,v in zip(labels,list(range(1,len(labels)+1)))}}df.replace(replace_map, inplace=True)print(replace_map)return df gm_regression_df = encoder(gm_regression_df)
AI帶你省錢旅游!精準預測民宿房源價格!

文章插圖
我們對于host_response_timeroom_type字段 , 使用獨熱向量編碼(啞變量變換)
host_dummy = pd.get_dummies(gm_regression_df['host_response_time'], prefix='host_response')room_dummy = pd.get_dummies(gm_regression_df['room_type'], prefix='room_type')# 拼接編碼后的字段gm_regression_df = pd.concat([gm_regression_df, host_dummy, room_dummy], axis=1)# 剔除原始字段gm_regression_df = gm_regression_df.drop(columns=['host_response_time', 'room_type'], axis=1)我們再把之前處理過的df_amenities做一點處理,再拼接到數據特征里
df_3 = pd.DataFrame(df_amenities.sum())features = df_3['amenities'][:150].to_list()amenities_updated = df_amenities.filter(items=(features))gm_regression_df = pd.concat([gm_regression_df, amenities_updated], axis=1)查看一下最終數據的維度
gm_regression_df.shape# (3584, 198)我們最后得到了198個字段,為了避免特征之間的多重共線性,使用方差因子法(VIF)來選擇機器學習模型的特征 。VIF 大于 10 的特征被刪除 , 因為這些特征的方差可以由數據集中的其他特征表示和解釋 。
# 計算VIFvif_model = gm_regression_df.drop(['price'], axis=1)vif_df = pd.DataFrame()vif_df['feature'] = vif_model.columnsvif_df['VIF'] = [variance_inflation_factor(vif_model.values, i) for i in range(len(vif_model.columns))]# 選出小于10的特征vif_df_new = vif_df[vif_df['VIF']<=10]feature_list =vif_df_new['feature'].to_list()# 選出這些特征對應的數據model_df = gm_regression_df.filter(items=(feature_list))model_df.head()
AI帶你省錢旅游!精準預測民宿房源價格!

文章插圖
我們拼接上price目標標簽字段,可以構建完整的數據集
price_col = gm_regression_df['price']model_df = model_df.join(price_col)機器學習算法我們在這里使用幾個典型的回歸算法,包括線性回歸、RandomForestRegression、Lasso Regression 和 GradientBoostingRegression 。
關于機器學習算法的應用方法,歡迎大家查閱ShowMeAI對應的教程與文章,快學快用 。

推薦閱讀