AI帶你省錢旅游!精準預測民宿房源價格!( 二 )

  • 機器學習實戰 | 機器學習特征工程最全解讀
  • 字段清洗因為數據中的字段眾多,有些字段比較亂,我們需要做一些數據清洗的工作,數據包含一些帶有URL的列 , 對最后的預測作用不大,我們把它們清洗掉 。
    # 刪除url字段def drop_function(df):df = df.drop(columns=['listing_url', 'description', 'host_thumbnail_url', 'host_picture_url', 'latitude', 'longitude', 'picture_url', 'host_url', 'host_location', 'neighbourhood', 'neighbourhood_cleansed', 'host_about', 'has_availability', 'availability_30', 'availability_60', 'availability_90', 'availability_365', 'calendar_last_scraped'])return dfgm_df = drop_function(gm_listings)刪除過后的數據如下,干凈很多
    AI帶你省錢旅游!精準預測民宿房源價格!

    文章插圖
    缺失值處理數據中也包含了一些缺失值 , 我們對它們進行分析處理:
    # 查看缺失值百分比(gm_df.isnull().sum()/gm_df.shape[0])* 100得到如下結果
    id0.000000scrape_id0.000000last_scraped0.000000name0.000000neighborhood_overview41.266741host_id0.000000host_name0.000000host_since0.000000host_response_time10.212054host_response_rate10.212054host_acceptance_rate5.636161host_is_superhost0.000000host_neighbourhood91.657366host_listings_count0.000000host_total_listings_count0.000000host_verifications0.000000host_has_profile_pic0.000000host_identity_verified0.000000neighbourhood_group_cleansed0.000000property_type0.000000room_type0.000000accommodates0.000000bathrooms100.000000bathrooms_text0.306920bedrooms4.687500beds2.120536amenities0.000000price0.000000minimum_nights0.000000maximum_nights0.000000minimum_minimum_nights0.000000maximum_minimum_nights0.000000minimum_maximum_nights0.000000maximum_maximum_nights0.000000minimum_nights_avg_ntm0.000000maximum_nights_avg_ntm0.000000calendar_updated100.000000number_of_reviews0.000000number_of_reviews_ltm0.000000number_of_reviews_l30d0.000000first_review19.810268last_review19.810268review_scores_rating19.810268review_scores_accuracy20.089286review_scores_cleanliness20.089286review_scores_checkin20.089286review_scores_communication20.089286review_scores_location20.089286review_scores_value20.089286license100.000000instant_bookable0.000000calculated_host_listings_count0.000000calculated_host_listings_count_entire_homes0.000000calculated_host_listings_count_private_rooms0.000000calculated_host_listings_count_shared_rooms0.000000reviews_per_month19.810268dtype: float64我們分幾種不同的比例情況對缺失值進行處理:
    • 高缺失比例的字段,如license、calendar_updated、bathrooms、host_neighborhood等包含90%以上的NaN值 , 包括neighborhood overview是41%的NaN,并且包含文本數據 。我們會直接剔除這些字段 。
    • 數值型字段 , 缺失不多的情況下,我們用字段平均值進行填充 。這保證了這些值的分布被保留下來 。這些列包括bedrooms、beds、review_scores_rating、review_scores_accuracy和其他打分字段 。
    • 類別型字段,像bathrooms_text和host_response_time,我們用眾數進行填充 。
    # 剔除高缺失比例字段def drop_function_2(df):df = df.drop(columns=['license', 'calendar_updated', 'bathrooms', 'host_neighbourhood', 'neighborhood_overview'])return dfgm_df = drop_function_2(gm_df)# 均值填充def input_mean(df, column_list):for columns in column_list:df[columns].fillna(value = https://www.huyubaike.com/biancheng/df[columns].mean(), inplace=True)return dfcolumn_list = ['review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness','review_scores_checkin', 'review_scores_communication', 'review_scores_location','review_scores_value', 'reviews_per_month','bedrooms', 'beds']gm_df = input_mean(gm_df, column_list)# 眾數填充def input_mode(df, column_list):for columns in column_list:df[columns].fillna(value = https://www.huyubaike.com/biancheng/df[columns].mode()[0], inplace=True)return dfcolumn_list = ['first_review', 'last_review', 'bathrooms_text', 'host_acceptance_rate','host_response_rate', 'host_response_time']gm_df = input_mode(gm_df, column_list)字段編碼host_is_superhost 和 has_availability 等列對應的字符串含義為 true 或 false,我們對其編碼替換為0或1 。
    gm_df = gm_df.replace({'host_is_superhost': 't', 'host_has_profile_pic': 't', 'host_identity_verified': 't', 'has_availability': 't', 'instant_bookable': 't'}, 1)gm_df = gm_df.replace({'host_is_superhost': 'f', 'host_has_profile_pic': 'f', 'host_identity_verified': 'f', 'has_availability': 'f', 'instant_bookable': 'f'}, 0)我們查看下替換后的數據分布
    gm_df['host_is_superhost'].value_counts()
    AI帶你省錢旅游!精準預測民宿房源價格!

    文章插圖
    字段格式轉換價格相關的字段,目前還是字符串類型,包含“$”等符號,我們對其處理并轉換為數值型 。

    推薦閱讀