Navigating the Housing Market Storm: A Data-Driven Approach - 拾光赋-拾光赋

Navigating the Housing Market Storm: A Data-Driven Approach

1年前发布

05213

Introduction

In the vast landscape of the real estate market, understanding the dynamics that influence housing prices is crucial. In this blog post, we embark on a data-driven journey to explore the intricacies of housing price prediction, using advanced regression techniques and ensemble learning. The dataset under scrutiny is the well-known “House Prices: Advanced Regression Techniques” dataset from Kaggle.

Explanation of Random Forest

Random forests are powerful predictive models that allow for data-driven exploration of many explanatory variables in predicting a response or target variable. They provide importance scores for each explanatory variable and enable the evaluation of correct classification with varying numbers of trees.

Data Preprocessing

Before diving into the Random Forest analysis, it’s essential to preprocess the data. This includes handling missing values, encoding categorical variables, and scaling features to ensure the model performs optimally.

Importing Modules and Data

We kick off by importing essential libraries such as NumPy, Pandas, and Scikit-Learn. The dataset, split into training and testing sets, is loaded into our analysis environment.


# Importing Modules
import sklearn
import scipy
# ... (other module imports)
# Importing training and testing data
train_data = pd.read_csv("/content/train.csv", index_col="Id")
test_data = pd.read_csv("/content/test.csv", index_col="Id")
# Importing Modules
import sklearn
import scipy
# ... (other module imports)

# Importing training and testing data
train_data = pd.read_csv("/content/train.csv", index_col="Id")
test_data = pd.read_csv("/content/test.csv", index_col="Id")
# Importing Modules
import sklearn
import scipy
# ... (other module imports)

# Importing training and testing data
train_data = pd.read_csv("/content/train.csv", index_col="Id")
test_data = pd.read_csv("/content/test.csv", index_col="Id")

Enter fullscreen mode Exit fullscreen mode

Explanation of Response and Explanatory Variables

In our analysis, the response variable (dependent variable) is the sale price of the houses, while the explanatory variables (independent variables) include various features such as the size of the house, number of bedrooms, location, etc. These variables were chosen based on their relevance to predicting housing prices.

Data Visualization

Scatter Plot to Check Raw Outliers
Our exploration begins with a scatter plot visualizing the relationship between the ground living area and sale prices. This aids in identifying potential outliers, setting the stage for data cleansing.


fig, ax = plt.subplots(figsize=(10, 6))
ax.grid()
ax.scatter(train_data["GrLivArea"], train_data["SalePrice"], c="#3f72af", zorder=3, alpha=0.9)
ax.axvline(4500, c="#112d4e", ls="--", zorder=2)
ax.set_xlabel("Ground living area (sq. ft)", labelpad=10)
ax.set_ylabel("Sale price ($)", labelpad=10)
fig, ax = plt.subplots(figsize=(10, 6))
ax.grid()
ax.scatter(train_data["GrLivArea"], train_data["SalePrice"], c="#3f72af", zorder=3, alpha=0.9)
ax.axvline(4500, c="#112d4e", ls="--", zorder=2)
ax.set_xlabel("Ground living area (sq. ft)", labelpad=10)
ax.set_ylabel("Sale price ($)", labelpad=10)
fig, ax = plt.subplots(figsize=(10, 6))
ax.grid()
ax.scatter(train_data["GrLivArea"], train_data["SalePrice"], c="#3f72af", zorder=3, alpha=0.9)
ax.axvline(4500, c="#112d4e", ls="--", zorder=2)
ax.set_xlabel("Ground living area (sq. ft)", labelpad=10)
ax.set_ylabel("Sale price ($)", labelpad=10)

Enter fullscreen mode Exit fullscreen mode

Data Cleaning

Removing outliers is a pivotal step in refining the dataset. In this case, we exclude instances where the ground living area exceeds 4450 sq. ft.


train_data = train_data[train_data["GrLivArea"] < 4450]
data = pd.concat([train_data.drop("SalePrice", axis=1), test_data])
train_data = train_data[train_data["GrLivArea"] < 4450]
data = pd.concat([train_data.drop("SalePrice", axis=1), test_data])
train_data = train_data[train_data["GrLivArea"] < 4450]
data = pd.concat([train_data.drop("SalePrice", axis=1), test_data])

Enter fullscreen mode Exit fullscreen mode

Bar Graph to Check Missing Values

Understanding the prevalence of missing values guides our imputation strategy. A bar graph illustrates the number of missing values for each feature.


nans = data.isna().sum().sort_values(ascending=False)
nans = nans[nans > 0]
fig, ax = plt.subplots(figsize=(10, 6))
ax.grid()
ax.bar(nans.index, nans.values, zorder=2, color="#3f72af")
ax.set_ylabel("No. of missing values", labelpad=10)
ax.set_xlim(-0.6, len(nans) - 0.4)
ax.xaxis.set_tick_params(rotation=90)
plt.show()
nans = data.isna().sum().sort_values(ascending=False)
nans = nans[nans > 0]
fig, ax = plt.subplots(figsize=(10, 6))
ax.grid()
ax.bar(nans.index, nans.values, zorder=2, color="#3f72af")
ax.set_ylabel("No. of missing values", labelpad=10)
ax.set_xlim(-0.6, len(nans) - 0.4)
ax.xaxis.set_tick_params(rotation=90)
plt.show()
nans = data.isna().sum().sort_values(ascending=False)
nans = nans[nans > 0]
fig, ax = plt.subplots(figsize=(10, 6))
ax.grid()
ax.bar(nans.index, nans.values, zorder=2, color="#3f72af")
ax.set_ylabel("No. of missing values", labelpad=10)
ax.set_xlim(-0.6, len(nans) - 0.4)
ax.xaxis.set_tick_params(rotation=90)
plt.show()

Enter fullscreen mode Exit fullscreen mode

Exploring Numerical Variables

We delve into the analysis of numerical features, distinguishing between discrete and continuous variables.

Discrete Values


# Continuous Values
continuous_variables = []
for feature in numerical_features:
    if feature not in discrete_variables and feature not in ["YearBuilt", "YearRemodAdd", "GarageYrBlt", "YrSold"]:
        continuous_variables.append(feature)
print(continuous_variables)
for feature in continuous_variables:
    train_data[feature].hist(bins=30)
    plt.title(feature)
    plt.show()
# Continuous Values
continuous_variables = []
for feature in numerical_features:
    if feature not in discrete_variables and feature not in ["YearBuilt", "YearRemodAdd", "GarageYrBlt", "YrSold"]:
        continuous_variables.append(feature)

print(continuous_variables)

for feature in continuous_variables:
    train_data[feature].hist(bins=30)
    plt.title(feature)
    plt.show()
# Continuous Values
continuous_variables = []
for feature in numerical_features:
    if feature not in discrete_variables and feature not in ["YearBuilt", "YearRemodAdd", "GarageYrBlt", "YrSold"]:
        continuous_variables.append(feature)

print(continuous_variables)

for feature in continuous_variables:
    train_data[feature].hist(bins=30)
    plt.title(feature)
    plt.show()

Enter fullscreen mode Exit fullscreen mode

Continuous Values


# ... (code for identifying and visualizing continuous variables)
# ... (code for identifying and visualizing continuous variables)
# ... (code for identifying and visualizing continuous variables)

Enter fullscreen mode Exit fullscreen mode

Categorical Variables


# Categorical Values
categorical_features = []
for feature in train_data.columns:
    if train_data[feature].dtype == 'O' and feature != 'SalePrice':
        categorical_features.append(feature)
print(categorical_features)
for feature in categorical_features:
    train_data.groupby(feature)['SalePrice'].mean().plot.bar()
    plt.title(feature + ' vs Sale Price')
    plt.show()
# Categorical Values
categorical_features = []
for feature in train_data.columns:
    if train_data[feature].dtype == 'O' and feature != 'SalePrice':
        categorical_features.append(feature)
print(categorical_features)

for feature in categorical_features:
    train_data.groupby(feature)['SalePrice'].mean().plot.bar()
    plt.title(feature + ' vs Sale Price')
    plt.show()
# Categorical Values
categorical_features = []
for feature in train_data.columns:
    if train_data[feature].dtype == 'O' and feature != 'SalePrice':
        categorical_features.append(feature)
print(categorical_features)

for feature in categorical_features:
    train_data.groupby(feature)['SalePrice'].mean().plot.bar()
    plt.title(feature + ' vs Sale Price')
    plt.show()

Enter fullscreen mode Exit fullscreen mode

Data Transformation & Feature Scaling

Data Transformation


# Data Transformation
data[["MSSubClass", "YrSold"]] = data[["MSSubClass", "YrSold"]].astype("category")  # converting into categorical value
data["MoSoldsin"] = np.sin(2 * np.pi * data["MoSold"] / 12)  # Sine Function
data["MoSoldcos"] = np.cos(2 * np.pi * data["MoSold"] / 12)  # Cosine Function
data = data.drop("MoSold", axis=1)
# Data Transformation
data[["MSSubClass", "YrSold"]] = data[["MSSubClass", "YrSold"]].astype("category")  # converting into categorical value
data["MoSoldsin"] = np.sin(2 * np.pi * data["MoSold"] / 12)  # Sine Function
data["MoSoldcos"] = np.cos(2 * np.pi * data["MoSold"] / 12)  # Cosine Function
data = data.drop("MoSold", axis=1)
# Data Transformation
data[["MSSubClass", "YrSold"]] = data[["MSSubClass", "YrSold"]].astype("category")  # converting into categorical value
data["MoSoldsin"] = np.sin(2 * np.pi * data["MoSold"] / 12)  # Sine Function
data["MoSoldcos"] = np.cos(2 * np.pi * data["MoSold"] / 12)  # Cosine Function
data = data.drop("MoSold", axis=1)

Enter fullscreen mode Exit fullscreen mode

Feature Scaling


# Feature Scaling
cols = data.select_dtypes(np.number).columns
data[cols] = RobustScaler().fit_transform(data[cols])
# Feature Scaling
cols = data.select_dtypes(np.number).columns
data[cols] = RobustScaler().fit_transform(data[cols])
# Feature Scaling
cols = data.select_dtypes(np.number).columns
data[cols] = RobustScaler().fit_transform(data[cols])

Enter fullscreen mode Exit fullscreen mode

Encoding


data = pd.get_dummies(data)
data = pd.get_dummies(data)
data = pd.get_dummies(data)

Enter fullscreen mode Exit fullscreen mode

Feature Recovery & Removing Outliers


X_train = data.loc[train_data.index]
X_test = data.loc[test_data.index]
X_train = data.loc[train_data.index]
X_test = data.loc[test_data.index]
X_train = data.loc[train_data.index]
X_test = data.loc[test_data.index]

Enter fullscreen mode Exit fullscreen mode

Optimization, Training, and Testing

Hyperparameter Optimization


# Hyper Parameter Optimization
kf = KFold(n_splits=5, random_state=0, shuffle=True)
rmse = lambda y, y_pred: np.sqrt(mean_squared_error(y, y_pred))
scorer = make_scorer(rmse, greater_is_better=False)
# We use Randomized Search for Optimization, since it is more efficient.
# Define a function which takes in model and parameter grid as inputs, uses Radomized search, and returns fit of the model
def random_search(model, grid, n_iter=100):
    if model == xgb.XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=4):
        searchxg = RandomizedSearchCV(estimator=model, param_distributions=grid, cv=kf, n_iter=n_iter, n_jobs=4, random_state=0, verbose=True)
        return searchxg.fit(X_train, y, early_stopping_rounds=5, verbose=True)
    else:
        search = RandomizedSearchCV(estimator=model, param_distributions=grid, cv=kf, n_iter=n_iter, n_jobs=4, random_state=0, verbose=True)
        return search.fit(X_train, y)
# Hyperparameter Grids
xgb_hpg = {'n_estimators': [100, 400, 800], 'max_depth': [3, 6, 9], 'learning_rate': [0.05, 0.1, 0.20], 'min_child_weight': [1, 10, 100]}  # XGBoost
ridge_hpg = {"alpha": np.logspace(-1, 2, 500)}  # Ridge Regressor
lasso_hpg = {"alpha": np.logspace(-5, -1, 500)}  # Lasso Regressor
svr_hpg = {"C": np.arange(1, 100), "gamma": np.linspace(0.00001, 0.001, 50), "epsilon": np.linspace(0.01, 0.1, 50)}  # Support Vector Regressor
lgbm_hpg = {"colsample_bytree": np.linspace(0.2, 0.7, 6), "learning_rate": np.logspace(-3, -1, 100)}  # LGBM
gbm_hpg = {"max_features": np.linspace(0.2, 0.7, 6), "learning_rate": np.logspace(-3, -1, 100)}  # Gradient Boost
cat_hpg = {'depth': [2, 9], 'iterations': [10, 30], 'learning_rate': [0.001, 0.1]}
# Randomized Search for each model
xgb_search = random_search(xgb.XGBRegressor(n_estimators=1000, n_jobs=4), xgb_hpg)  # XGBoost
ridge_search = random_search(Ridge(), ridge_hpg)  # Ridge Regressor
lasso_search = random_search(Lasso(), lasso_hpg)  # Lasso Regressor
svr_search = random_search(SVR(), svr_hpg, n_iter=100)  # Support Vector Regressor
lgbm_search = random_search(LGBMRegressor(n_estimators=2000, max_depth=3), lgbm_hpg, n_iter=100)  # LGBM
gbm_search = random_search(GradientBoostingRegressor(n_estimators=2000, max_depth=3), gbm_hpg, n_iter=100)  # Gradient Boost
# Hyper Parameter Optimization
kf = KFold(n_splits=5, random_state=0, shuffle=True)
rmse = lambda y, y_pred: np.sqrt(mean_squared_error(y, y_pred))
scorer = make_scorer(rmse, greater_is_better=False)

# We use Randomized Search for Optimization, since it is more efficient.
# Define a function which takes in model and parameter grid as inputs, uses Radomized search, and returns fit of the model
def random_search(model, grid, n_iter=100):
    if model == xgb.XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=4):
        searchxg = RandomizedSearchCV(estimator=model, param_distributions=grid, cv=kf, n_iter=n_iter, n_jobs=4, random_state=0, verbose=True)
        return searchxg.fit(X_train, y, early_stopping_rounds=5, verbose=True)
    else:
        search = RandomizedSearchCV(estimator=model, param_distributions=grid, cv=kf, n_iter=n_iter, n_jobs=4, random_state=0, verbose=True)
        return search.fit(X_train, y)

# Hyperparameter Grids
xgb_hpg = {'n_estimators': [100, 400, 800], 'max_depth': [3, 6, 9], 'learning_rate': [0.05, 0.1, 0.20], 'min_child_weight': [1, 10, 100]}  # XGBoost
ridge_hpg = {"alpha": np.logspace(-1, 2, 500)}  # Ridge Regressor
lasso_hpg = {"alpha": np.logspace(-5, -1, 500)}  # Lasso Regressor
svr_hpg = {"C": np.arange(1, 100), "gamma": np.linspace(0.00001, 0.001, 50), "epsilon": np.linspace(0.01, 0.1, 50)}  # Support Vector Regressor
lgbm_hpg = {"colsample_bytree": np.linspace(0.2, 0.7, 6), "learning_rate": np.logspace(-3, -1, 100)}  # LGBM
gbm_hpg = {"max_features": np.linspace(0.2, 0.7, 6), "learning_rate": np.logspace(-3, -1, 100)}  # Gradient Boost
cat_hpg = {'depth': [2, 9], 'iterations': [10, 30], 'learning_rate': [0.001, 0.1]}

# Randomized Search for each model
xgb_search = random_search(xgb.XGBRegressor(n_estimators=1000, n_jobs=4), xgb_hpg)  # XGBoost
ridge_search = random_search(Ridge(), ridge_hpg)  # Ridge Regressor
lasso_search = random_search(Lasso(), lasso_hpg)  # Lasso Regressor
svr_search = random_search(SVR(), svr_hpg, n_iter=100)  # Support Vector Regressor
lgbm_search = random_search(LGBMRegressor(n_estimators=2000, max_depth=3), lgbm_hpg, n_iter=100)  # LGBM
gbm_search = random_search(GradientBoostingRegressor(n_estimators=2000, max_depth=3), gbm_hpg, n_iter=100)  # Gradient Boost
# Hyper Parameter Optimization
kf = KFold(n_splits=5, random_state=0, shuffle=True)
rmse = lambda y, y_pred: np.sqrt(mean_squared_error(y, y_pred))
scorer = make_scorer(rmse, greater_is_better=False)

# We use Randomized Search for Optimization, since it is more efficient.
# Define a function which takes in model and parameter grid as inputs, uses Radomized search, and returns fit of the model
def random_search(model, grid, n_iter=100):
    if model == xgb.XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=4):
        searchxg = RandomizedSearchCV(estimator=model, param_distributions=grid, cv=kf, n_iter=n_iter, n_jobs=4, random_state=0, verbose=True)
        return searchxg.fit(X_train, y, early_stopping_rounds=5, verbose=True)
    else:
        search = RandomizedSearchCV(estimator=model, param_distributions=grid, cv=kf, n_iter=n_iter, n_jobs=4, random_state=0, verbose=True)
        return search.fit(X_train, y)

# Hyperparameter Grids
xgb_hpg = {'n_estimators': [100, 400, 800], 'max_depth': [3, 6, 9], 'learning_rate': [0.05, 0.1, 0.20], 'min_child_weight': [1, 10, 100]}  # XGBoost
ridge_hpg = {"alpha": np.logspace(-1, 2, 500)}  # Ridge Regressor
lasso_hpg = {"alpha": np.logspace(-5, -1, 500)}  # Lasso Regressor
svr_hpg = {"C": np.arange(1, 100), "gamma": np.linspace(0.00001, 0.001, 50), "epsilon": np.linspace(0.01, 0.1, 50)}  # Support Vector Regressor
lgbm_hpg = {"colsample_bytree": np.linspace(0.2, 0.7, 6), "learning_rate": np.logspace(-3, -1, 100)}  # LGBM
gbm_hpg = {"max_features": np.linspace(0.2, 0.7, 6), "learning_rate": np.logspace(-3, -1, 100)}  # Gradient Boost
cat_hpg = {'depth': [2, 9], 'iterations': [10, 30], 'learning_rate': [0.001, 0.1]}

# Randomized Search for each model
xgb_search = random_search(xgb.XGBRegressor(n_estimators=1000, n_jobs=4), xgb_hpg)  # XGBoost
ridge_search = random_search(Ridge(), ridge_hpg)  # Ridge Regressor
lasso_search = random_search(Lasso(), lasso_hpg)  # Lasso Regressor
svr_search = random_search(SVR(), svr_hpg, n_iter=100)  # Support Vector Regressor
lgbm_search = random_search(LGBMRegressor(n_estimators=2000, max_depth=3), lgbm_hpg, n_iter=100)  # LGBM
gbm_search = random_search(GradientBoostingRegressor(n_estimators=2000, max_depth=3), gbm_hpg, n_iter=100)  # Gradient Boost

Enter fullscreen mode Exit fullscreen mode

Random Forest Analysis

Now, let’s run a Random Forest analysis to predict housing prices. We’ll split the data into training and testing sets, train the model on the training data, and evaluate its performance on the testing data.

Ensemble Learning Model


# Ensemble Learning Model
models = [search.best_estimator_ for search in [xgb_search, ridge_search, lasso_search, svr_search, lgbm_search, gbm_search]]  # list of best estimators from each model
ensemble_search = random_search(StackingCVRegressor(models, Ridge(), cv=kf), {"meta_regressor__alpha": np.logspace(-3, -2, 500)}, n_iter=20)  # Ensemble Stack
models.append(ensemble_search.best_estimator_)  # list of best estimators from each model including Stack
# Ensemble Learning Model
models = [search.best_estimator_ for search in [xgb_search, ridge_search, lasso_search, svr_search, lgbm_search, gbm_search]]  # list of best estimators from each model
ensemble_search = random_search(StackingCVRegressor(models, Ridge(), cv=kf), {"meta_regressor__alpha": np.logspace(-3, -2, 500)}, n_iter=20)  # Ensemble Stack
models.append(ensemble_search.best_estimator_)  # list of best estimators from each model including Stack
# Ensemble Learning Model
models = [search.best_estimator_ for search in [xgb_search, ridge_search, lasso_search, svr_search, lgbm_search, gbm_search]]  # list of best estimators from each model
ensemble_search = random_search(StackingCVRegressor(models, Ridge(), cv=kf), {"meta_regressor__alpha": np.logspace(-3, -2, 500)}, n_iter=20)  # Ensemble Stack
models.append(ensemble_search.best_estimator_)  # list of best estimators from each model including Stack

Enter fullscreen mode Exit fullscreen mode

Predicting Values & Submission


# Predicting Values & Submission
prediction = [i.predict(X_test) for i in models]  # Np array of Predictions
predictions = np.average(prediction, axis=0)  # average of all the values
# Convert the predictions into the given format, and finally convert them back to normal using the exponential function
my_prediction = pd.DataFrame({"Id": test_data.index, "SalePrice": np.exp(predictions)})  # given format
my_prediction.to_csv("E:\Education\Kaggle Projects\House Price - Advanced Regression/my_prediction_ensemble.csv", index=False)  # Saving to CSV
# Predicting Values & Submission
prediction = [i.predict(X_test) for i in models]  # Np array of Predictions
predictions = np.average(prediction, axis=0)  # average of all the values

# Convert the predictions into the given format, and finally convert them back to normal using the exponential function
my_prediction = pd.DataFrame({"Id": test_data.index, "SalePrice": np.exp(predictions)})  # given format
my_prediction.to_csv("E:\Education\Kaggle Projects\House Price - Advanced Regression/my_prediction_ensemble.csv", index=False)  # Saving to CSV
# Predicting Values & Submission
prediction = [i.predict(X_test) for i in models]  # Np array of Predictions
predictions = np.average(prediction, axis=0)  # average of all the values

# Convert the predictions into the given format, and finally convert them back to normal using the exponential function
my_prediction = pd.DataFrame({"Id": test_data.index, "SalePrice": np.exp(predictions)})  # given format
my_prediction.to_csv("E:\Education\Kaggle Projects\House Price - Advanced Regression/my_prediction_ensemble.csv", index=False)  # Saving to CSV

Enter fullscreen mode Exit fullscreen mode

Output Interpretation

The accuracy score obtained from the Random Forest model indicates how well the model predicts housing prices based on the given features. We can further analyze the importance scores assigned to each explanatory variable to understand their impact on the prediction.

Conclusion

Our data-driven exploration and modelling journey provides valuable insights into predicting housing prices. By leveraging advanced regression techniques and ensemble learning, we navigate through challenges, optimize models, and make predictions that contribute to the dynamic landscape of the housing market.

Click here to view and run the code in Google Colab.

Stay tuned for more data-driven adventures!

原文链接：Navigating the Housing Market Storm: A Data-Driven Approach

展开阅读全文

© 版权声明

文章版权声明 1、本网站名称：拾光赋
2、本站永久网址：https://www.blogs.ink
3、本网站的文章部分内容可能来源于网络，仅供大家学习与参考，如有侵权，请联系站长QQ：805375623进行删除处理。
4、本站一切资源不代表本站立场，并不代表本站赞同其观点和对其真实性负责。
5、本站一律禁止以任何方式发布或转载任何违法的相关信息，访客发现请向站长举报
6、本站资源大多存储在云盘，如发现链接失效，请联系我们我们会第一时间更新。

THE END

Python（EN）
# python # machinelearning # datascience

喜欢就支持一下吧

So what if we fall down? At least we are still young.

摔倒了又怎样，至少我们还年轻

相关推荐

评论抢沙发

请登录后发表评论

暂无评论内容