Introduction
In the vast landscape of the real estate market, understanding the dynamics that influence housing prices is crucial. In this blog post, we embark on a data-driven journey to explore the intricacies of housing price prediction, using advanced regression techniques and ensemble learning. The dataset under scrutiny is the well-known “House Prices: Advanced Regression Techniques” dataset from Kaggle.
Explanation of Random Forest
Random forests are powerful predictive models that allow for data-driven exploration of many explanatory variables in predicting a response or target variable. They provide importance scores for each explanatory variable and enable the evaluation of correct classification with varying numbers of trees.
Data Preprocessing
Before diving into the Random Forest analysis, it’s essential to preprocess the data. This includes handling missing values, encoding categorical variables, and scaling features to ensure the model performs optimally.
Importing Modules and Data
We kick off by importing essential libraries such as NumPy, Pandas, and Scikit-Learn. The dataset, split into training and testing sets, is loaded into our analysis environment.
# Importing Modulesimport sklearnimport scipy# ... (other module imports)# Importing training and testing datatrain_data = pd.read_csv("/content/train.csv", index_col="Id")test_data = pd.read_csv("/content/test.csv", index_col="Id")# Importing Modules import sklearn import scipy # ... (other module imports) # Importing training and testing data train_data = pd.read_csv("/content/train.csv", index_col="Id") test_data = pd.read_csv("/content/test.csv", index_col="Id")# Importing Modules import sklearn import scipy # ... (other module imports) # Importing training and testing data train_data = pd.read_csv("/content/train.csv", index_col="Id") test_data = pd.read_csv("/content/test.csv", index_col="Id")
Enter fullscreen mode Exit fullscreen mode
Explanation of Response and Explanatory Variables
In our analysis, the response variable (dependent variable) is the sale price of the houses, while the explanatory variables (independent variables) include various features such as the size of the house, number of bedrooms, location, etc. These variables were chosen based on their relevance to predicting housing prices.
Data Visualization
Scatter Plot to Check Raw Outliers
Our exploration begins with a scatter plot visualizing the relationship between the ground living area and sale prices. This aids in identifying potential outliers, setting the stage for data cleansing.
fig, ax = plt.subplots(figsize=(10, 6))ax.grid()ax.scatter(train_data["GrLivArea"], train_data["SalePrice"], c="#3f72af", zorder=3, alpha=0.9)ax.axvline(4500, c="#112d4e", ls="--", zorder=2)ax.set_xlabel("Ground living area (sq. ft)", labelpad=10)ax.set_ylabel("Sale price ($)", labelpad=10)fig, ax = plt.subplots(figsize=(10, 6)) ax.grid() ax.scatter(train_data["GrLivArea"], train_data["SalePrice"], c="#3f72af", zorder=3, alpha=0.9) ax.axvline(4500, c="#112d4e", ls="--", zorder=2) ax.set_xlabel("Ground living area (sq. ft)", labelpad=10) ax.set_ylabel("Sale price ($)", labelpad=10)fig, ax = plt.subplots(figsize=(10, 6)) ax.grid() ax.scatter(train_data["GrLivArea"], train_data["SalePrice"], c="#3f72af", zorder=3, alpha=0.9) ax.axvline(4500, c="#112d4e", ls="--", zorder=2) ax.set_xlabel("Ground living area (sq. ft)", labelpad=10) ax.set_ylabel("Sale price ($)", labelpad=10)
Enter fullscreen mode Exit fullscreen mode
Data Cleaning
Removing outliers is a pivotal step in refining the dataset. In this case, we exclude instances where the ground living area exceeds 4450 sq. ft.
train_data = train_data[train_data["GrLivArea"] < 4450]data = pd.concat([train_data.drop("SalePrice", axis=1), test_data])train_data = train_data[train_data["GrLivArea"] < 4450] data = pd.concat([train_data.drop("SalePrice", axis=1), test_data])train_data = train_data[train_data["GrLivArea"] < 4450] data = pd.concat([train_data.drop("SalePrice", axis=1), test_data])
Enter fullscreen mode Exit fullscreen mode
Bar Graph to Check Missing Values
Understanding the prevalence of missing values guides our imputation strategy. A bar graph illustrates the number of missing values for each feature.
nans = data.isna().sum().sort_values(ascending=False)nans = nans[nans > 0]fig, ax = plt.subplots(figsize=(10, 6))ax.grid()ax.bar(nans.index, nans.values, zorder=2, color="#3f72af")ax.set_ylabel("No. of missing values", labelpad=10)ax.set_xlim(-0.6, len(nans) - 0.4)ax.xaxis.set_tick_params(rotation=90)plt.show()nans = data.isna().sum().sort_values(ascending=False) nans = nans[nans > 0] fig, ax = plt.subplots(figsize=(10, 6)) ax.grid() ax.bar(nans.index, nans.values, zorder=2, color="#3f72af") ax.set_ylabel("No. of missing values", labelpad=10) ax.set_xlim(-0.6, len(nans) - 0.4) ax.xaxis.set_tick_params(rotation=90) plt.show()nans = data.isna().sum().sort_values(ascending=False) nans = nans[nans > 0] fig, ax = plt.subplots(figsize=(10, 6)) ax.grid() ax.bar(nans.index, nans.values, zorder=2, color="#3f72af") ax.set_ylabel("No. of missing values", labelpad=10) ax.set_xlim(-0.6, len(nans) - 0.4) ax.xaxis.set_tick_params(rotation=90) plt.show()
Enter fullscreen mode Exit fullscreen mode
Exploring Numerical Variables
We delve into the analysis of numerical features, distinguishing between discrete and continuous variables.
Discrete Values
# Continuous Valuescontinuous_variables = []for feature in numerical_features:if feature not in discrete_variables and feature not in ["YearBuilt", "YearRemodAdd", "GarageYrBlt", "YrSold"]:continuous_variables.append(feature)print(continuous_variables)for feature in continuous_variables:train_data[feature].hist(bins=30)plt.title(feature)plt.show()# Continuous Values continuous_variables = [] for feature in numerical_features: if feature not in discrete_variables and feature not in ["YearBuilt", "YearRemodAdd", "GarageYrBlt", "YrSold"]: continuous_variables.append(feature) print(continuous_variables) for feature in continuous_variables: train_data[feature].hist(bins=30) plt.title(feature) plt.show()# Continuous Values continuous_variables = [] for feature in numerical_features: if feature not in discrete_variables and feature not in ["YearBuilt", "YearRemodAdd", "GarageYrBlt", "YrSold"]: continuous_variables.append(feature) print(continuous_variables) for feature in continuous_variables: train_data[feature].hist(bins=30) plt.title(feature) plt.show()
Enter fullscreen mode Exit fullscreen mode
Continuous Values
# ... (code for identifying and visualizing continuous variables)# ... (code for identifying and visualizing continuous variables)# ... (code for identifying and visualizing continuous variables)
Enter fullscreen mode Exit fullscreen mode
Categorical Variables
# Categorical Valuescategorical_features = []for feature in train_data.columns:if train_data[feature].dtype == 'O' and feature != 'SalePrice':categorical_features.append(feature)print(categorical_features)for feature in categorical_features:train_data.groupby(feature)['SalePrice'].mean().plot.bar()plt.title(feature + ' vs Sale Price')plt.show()# Categorical Values categorical_features = [] for feature in train_data.columns: if train_data[feature].dtype == 'O' and feature != 'SalePrice': categorical_features.append(feature) print(categorical_features) for feature in categorical_features: train_data.groupby(feature)['SalePrice'].mean().plot.bar() plt.title(feature + ' vs Sale Price') plt.show()# Categorical Values categorical_features = [] for feature in train_data.columns: if train_data[feature].dtype == 'O' and feature != 'SalePrice': categorical_features.append(feature) print(categorical_features) for feature in categorical_features: train_data.groupby(feature)['SalePrice'].mean().plot.bar() plt.title(feature + ' vs Sale Price') plt.show()
Enter fullscreen mode Exit fullscreen mode
Data Transformation & Feature Scaling
Data Transformation
# Data Transformationdata[["MSSubClass", "YrSold"]] = data[["MSSubClass", "YrSold"]].astype("category") # converting into categorical valuedata["MoSoldsin"] = np.sin(2 * np.pi * data["MoSold"] / 12) # Sine Functiondata["MoSoldcos"] = np.cos(2 * np.pi * data["MoSold"] / 12) # Cosine Functiondata = data.drop("MoSold", axis=1)# Data Transformation data[["MSSubClass", "YrSold"]] = data[["MSSubClass", "YrSold"]].astype("category") # converting into categorical value data["MoSoldsin"] = np.sin(2 * np.pi * data["MoSold"] / 12) # Sine Function data["MoSoldcos"] = np.cos(2 * np.pi * data["MoSold"] / 12) # Cosine Function data = data.drop("MoSold", axis=1)# Data Transformation data[["MSSubClass", "YrSold"]] = data[["MSSubClass", "YrSold"]].astype("category") # converting into categorical value data["MoSoldsin"] = np.sin(2 * np.pi * data["MoSold"] / 12) # Sine Function data["MoSoldcos"] = np.cos(2 * np.pi * data["MoSold"] / 12) # Cosine Function data = data.drop("MoSold", axis=1)
Enter fullscreen mode Exit fullscreen mode
Feature Scaling
# Feature Scalingcols = data.select_dtypes(np.number).columnsdata[cols] = RobustScaler().fit_transform(data[cols])# Feature Scaling cols = data.select_dtypes(np.number).columns data[cols] = RobustScaler().fit_transform(data[cols])# Feature Scaling cols = data.select_dtypes(np.number).columns data[cols] = RobustScaler().fit_transform(data[cols])
Enter fullscreen mode Exit fullscreen mode
Encoding
data = pd.get_dummies(data)data = pd.get_dummies(data)data = pd.get_dummies(data)
Enter fullscreen mode Exit fullscreen mode
Feature Recovery & Removing Outliers
X_train = data.loc[train_data.index]X_test = data.loc[test_data.index]X_train = data.loc[train_data.index] X_test = data.loc[test_data.index]X_train = data.loc[train_data.index] X_test = data.loc[test_data.index]
Enter fullscreen mode Exit fullscreen mode
Optimization, Training, and Testing
Hyperparameter Optimization
# Hyper Parameter Optimizationkf = KFold(n_splits=5, random_state=0, shuffle=True)rmse = lambda y, y_pred: np.sqrt(mean_squared_error(y, y_pred))scorer = make_scorer(rmse, greater_is_better=False)# We use Randomized Search for Optimization, since it is more efficient.# Define a function which takes in model and parameter grid as inputs, uses Radomized search, and returns fit of the modeldef random_search(model, grid, n_iter=100):if model == xgb.XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=4):searchxg = RandomizedSearchCV(estimator=model, param_distributions=grid, cv=kf, n_iter=n_iter, n_jobs=4, random_state=0, verbose=True)return searchxg.fit(X_train, y, early_stopping_rounds=5, verbose=True)else:search = RandomizedSearchCV(estimator=model, param_distributions=grid, cv=kf, n_iter=n_iter, n_jobs=4, random_state=0, verbose=True)return search.fit(X_train, y)# Hyperparameter Gridsxgb_hpg = {'n_estimators': [100, 400, 800], 'max_depth': [3, 6, 9], 'learning_rate': [0.05, 0.1, 0.20], 'min_child_weight': [1, 10, 100]} # XGBoostridge_hpg = {"alpha": np.logspace(-1, 2, 500)} # Ridge Regressorlasso_hpg = {"alpha": np.logspace(-5, -1, 500)} # Lasso Regressorsvr_hpg = {"C": np.arange(1, 100), "gamma": np.linspace(0.00001, 0.001, 50), "epsilon": np.linspace(0.01, 0.1, 50)} # Support Vector Regressorlgbm_hpg = {"colsample_bytree": np.linspace(0.2, 0.7, 6), "learning_rate": np.logspace(-3, -1, 100)} # LGBMgbm_hpg = {"max_features": np.linspace(0.2, 0.7, 6), "learning_rate": np.logspace(-3, -1, 100)} # Gradient Boostcat_hpg = {'depth': [2, 9], 'iterations': [10, 30], 'learning_rate': [0.001, 0.1]}# Randomized Search for each modelxgb_search = random_search(xgb.XGBRegressor(n_estimators=1000, n_jobs=4), xgb_hpg) # XGBoostridge_search = random_search(Ridge(), ridge_hpg) # Ridge Regressorlasso_search = random_search(Lasso(), lasso_hpg) # Lasso Regressorsvr_search = random_search(SVR(), svr_hpg, n_iter=100) # Support Vector Regressorlgbm_search = random_search(LGBMRegressor(n_estimators=2000, max_depth=3), lgbm_hpg, n_iter=100) # LGBMgbm_search = random_search(GradientBoostingRegressor(n_estimators=2000, max_depth=3), gbm_hpg, n_iter=100) # Gradient Boost# Hyper Parameter Optimization kf = KFold(n_splits=5, random_state=0, shuffle=True) rmse = lambda y, y_pred: np.sqrt(mean_squared_error(y, y_pred)) scorer = make_scorer(rmse, greater_is_better=False) # We use Randomized Search for Optimization, since it is more efficient. # Define a function which takes in model and parameter grid as inputs, uses Radomized search, and returns fit of the model def random_search(model, grid, n_iter=100): if model == xgb.XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=4): searchxg = RandomizedSearchCV(estimator=model, param_distributions=grid, cv=kf, n_iter=n_iter, n_jobs=4, random_state=0, verbose=True) return searchxg.fit(X_train, y, early_stopping_rounds=5, verbose=True) else: search = RandomizedSearchCV(estimator=model, param_distributions=grid, cv=kf, n_iter=n_iter, n_jobs=4, random_state=0, verbose=True) return search.fit(X_train, y) # Hyperparameter Grids xgb_hpg = {'n_estimators': [100, 400, 800], 'max_depth': [3, 6, 9], 'learning_rate': [0.05, 0.1, 0.20], 'min_child_weight': [1, 10, 100]} # XGBoost ridge_hpg = {"alpha": np.logspace(-1, 2, 500)} # Ridge Regressor lasso_hpg = {"alpha": np.logspace(-5, -1, 500)} # Lasso Regressor svr_hpg = {"C": np.arange(1, 100), "gamma": np.linspace(0.00001, 0.001, 50), "epsilon": np.linspace(0.01, 0.1, 50)} # Support Vector Regressor lgbm_hpg = {"colsample_bytree": np.linspace(0.2, 0.7, 6), "learning_rate": np.logspace(-3, -1, 100)} # LGBM gbm_hpg = {"max_features": np.linspace(0.2, 0.7, 6), "learning_rate": np.logspace(-3, -1, 100)} # Gradient Boost cat_hpg = {'depth': [2, 9], 'iterations': [10, 30], 'learning_rate': [0.001, 0.1]} # Randomized Search for each model xgb_search = random_search(xgb.XGBRegressor(n_estimators=1000, n_jobs=4), xgb_hpg) # XGBoost ridge_search = random_search(Ridge(), ridge_hpg) # Ridge Regressor lasso_search = random_search(Lasso(), lasso_hpg) # Lasso Regressor svr_search = random_search(SVR(), svr_hpg, n_iter=100) # Support Vector Regressor lgbm_search = random_search(LGBMRegressor(n_estimators=2000, max_depth=3), lgbm_hpg, n_iter=100) # LGBM gbm_search = random_search(GradientBoostingRegressor(n_estimators=2000, max_depth=3), gbm_hpg, n_iter=100) # Gradient Boost# Hyper Parameter Optimization kf = KFold(n_splits=5, random_state=0, shuffle=True) rmse = lambda y, y_pred: np.sqrt(mean_squared_error(y, y_pred)) scorer = make_scorer(rmse, greater_is_better=False) # We use Randomized Search for Optimization, since it is more efficient. # Define a function which takes in model and parameter grid as inputs, uses Radomized search, and returns fit of the model def random_search(model, grid, n_iter=100): if model == xgb.XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=4): searchxg = RandomizedSearchCV(estimator=model, param_distributions=grid, cv=kf, n_iter=n_iter, n_jobs=4, random_state=0, verbose=True) return searchxg.fit(X_train, y, early_stopping_rounds=5, verbose=True) else: search = RandomizedSearchCV(estimator=model, param_distributions=grid, cv=kf, n_iter=n_iter, n_jobs=4, random_state=0, verbose=True) return search.fit(X_train, y) # Hyperparameter Grids xgb_hpg = {'n_estimators': [100, 400, 800], 'max_depth': [3, 6, 9], 'learning_rate': [0.05, 0.1, 0.20], 'min_child_weight': [1, 10, 100]} # XGBoost ridge_hpg = {"alpha": np.logspace(-1, 2, 500)} # Ridge Regressor lasso_hpg = {"alpha": np.logspace(-5, -1, 500)} # Lasso Regressor svr_hpg = {"C": np.arange(1, 100), "gamma": np.linspace(0.00001, 0.001, 50), "epsilon": np.linspace(0.01, 0.1, 50)} # Support Vector Regressor lgbm_hpg = {"colsample_bytree": np.linspace(0.2, 0.7, 6), "learning_rate": np.logspace(-3, -1, 100)} # LGBM gbm_hpg = {"max_features": np.linspace(0.2, 0.7, 6), "learning_rate": np.logspace(-3, -1, 100)} # Gradient Boost cat_hpg = {'depth': [2, 9], 'iterations': [10, 30], 'learning_rate': [0.001, 0.1]} # Randomized Search for each model xgb_search = random_search(xgb.XGBRegressor(n_estimators=1000, n_jobs=4), xgb_hpg) # XGBoost ridge_search = random_search(Ridge(), ridge_hpg) # Ridge Regressor lasso_search = random_search(Lasso(), lasso_hpg) # Lasso Regressor svr_search = random_search(SVR(), svr_hpg, n_iter=100) # Support Vector Regressor lgbm_search = random_search(LGBMRegressor(n_estimators=2000, max_depth=3), lgbm_hpg, n_iter=100) # LGBM gbm_search = random_search(GradientBoostingRegressor(n_estimators=2000, max_depth=3), gbm_hpg, n_iter=100) # Gradient Boost
Enter fullscreen mode Exit fullscreen mode
Random Forest Analysis
Now, let’s run a Random Forest analysis to predict housing prices. We’ll split the data into training and testing sets, train the model on the training data, and evaluate its performance on the testing data.
Ensemble Learning Model
# Ensemble Learning Modelmodels = [search.best_estimator_ for search in [xgb_search, ridge_search, lasso_search, svr_search, lgbm_search, gbm_search]] # list of best estimators from each modelensemble_search = random_search(StackingCVRegressor(models, Ridge(), cv=kf), {"meta_regressor__alpha": np.logspace(-3, -2, 500)}, n_iter=20) # Ensemble Stackmodels.append(ensemble_search.best_estimator_) # list of best estimators from each model including Stack# Ensemble Learning Model models = [search.best_estimator_ for search in [xgb_search, ridge_search, lasso_search, svr_search, lgbm_search, gbm_search]] # list of best estimators from each model ensemble_search = random_search(StackingCVRegressor(models, Ridge(), cv=kf), {"meta_regressor__alpha": np.logspace(-3, -2, 500)}, n_iter=20) # Ensemble Stack models.append(ensemble_search.best_estimator_) # list of best estimators from each model including Stack# Ensemble Learning Model models = [search.best_estimator_ for search in [xgb_search, ridge_search, lasso_search, svr_search, lgbm_search, gbm_search]] # list of best estimators from each model ensemble_search = random_search(StackingCVRegressor(models, Ridge(), cv=kf), {"meta_regressor__alpha": np.logspace(-3, -2, 500)}, n_iter=20) # Ensemble Stack models.append(ensemble_search.best_estimator_) # list of best estimators from each model including Stack
Enter fullscreen mode Exit fullscreen mode
Predicting Values & Submission
# Predicting Values & Submissionprediction = [i.predict(X_test) for i in models] # Np array of Predictionspredictions = np.average(prediction, axis=0) # average of all the values# Convert the predictions into the given format, and finally convert them back to normal using the exponential functionmy_prediction = pd.DataFrame({"Id": test_data.index, "SalePrice": np.exp(predictions)}) # given formatmy_prediction.to_csv("E:\Education\Kaggle Projects\House Price - Advanced Regression/my_prediction_ensemble.csv", index=False) # Saving to CSV# Predicting Values & Submission prediction = [i.predict(X_test) for i in models] # Np array of Predictions predictions = np.average(prediction, axis=0) # average of all the values # Convert the predictions into the given format, and finally convert them back to normal using the exponential function my_prediction = pd.DataFrame({"Id": test_data.index, "SalePrice": np.exp(predictions)}) # given format my_prediction.to_csv("E:\Education\Kaggle Projects\House Price - Advanced Regression/my_prediction_ensemble.csv", index=False) # Saving to CSV# Predicting Values & Submission prediction = [i.predict(X_test) for i in models] # Np array of Predictions predictions = np.average(prediction, axis=0) # average of all the values # Convert the predictions into the given format, and finally convert them back to normal using the exponential function my_prediction = pd.DataFrame({"Id": test_data.index, "SalePrice": np.exp(predictions)}) # given format my_prediction.to_csv("E:\Education\Kaggle Projects\House Price - Advanced Regression/my_prediction_ensemble.csv", index=False) # Saving to CSV
Enter fullscreen mode Exit fullscreen mode
Output Interpretation
The accuracy score obtained from the Random Forest model indicates how well the model predicts housing prices based on the given features. We can further analyze the importance scores assigned to each explanatory variable to understand their impact on the prediction.
Conclusion
Our data-driven exploration and modelling journey provides valuable insights into predicting housing prices. By leveraging advanced regression techniques and ensemble learning, we navigate through challenges, optimize models, and make predictions that contribute to the dynamic landscape of the housing market.
Click here to view and run the code in Google Colab.
Stay tuned for more data-driven adventures!
原文链接:Navigating the Housing Market Storm: A Data-Driven Approach
暂无评论内容