Mecklenburg County, NC is home to more than a million residents and the city of Charlotte. With so many residents, the real estate market is key. For this project, I will walk through my process to build a simple model that predicts home sale prices.
This project is based on the R-based mid-term project for MUSA 508 at the University of Pennsylvania. I used the data from this class and developed this version in Python.
# Set up
import pandas as pd
import geopandas as gpd
import numpy as np
from matplotlib import pyplot as plt
import seaborn
from sklearn.neighbors import NearestNeighbors
import warnings
warnings.filterwarnings('ignore')
# For model building
# Linear models
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
# Model selection
from sklearn.model_selection import train_test_split
# Pipelines
from sklearn.pipeline import make_pipeline
# Preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import cross_val_score
# For statistical summaries
import statsmodels.api as sm
import statsmodels.formula.api as smf
# For Moran's I and LISA statistics
from libpysal.weights.contiguity import Queen
from esda.moran import Moran
from splot import esda as esdaplot
from pysal.lib import weights
from esda.moran import Moran_Local
# read in dataset
raw_data = gpd.read_file("data/studentData.geojson")
Our dataset includes more than 43,000 home sales in Mecklenburg between 2020 and 2022. First, I'll spend time cleaning this dataset of missing and irrelevant information.
# Review dataset
raw_data.head()
raw_data.columns
# Pare down columns - select columns to keep and convert to a geographic object
cols_wanted = ["municipali", "dateofsale", "price", "yearbuilt", "heatedarea", "cdebuildin", "descbuildi", "storyheigh", "aheatingty", "heatedfuel", "actype", "extwall", "foundation", "numfirepla", "bldggrade", "fullbaths", "halfbaths", "bedrooms", "units", "shape_Leng", "shape_Area", "sale_year", "toPredict", "geometry"]
cleaning_data = raw_data[cols_wanted]
cleaning_data = gpd.GeoDataFrame(cleaning_data).to_crs("EPSG:4326")
# Remove MUSA challenge houses
#cleaning_data['toPredict'].unique() # see what the options are
cleaning_data = cleaning_data.loc[cleaning_data['toPredict'] == 'MODELLING']
# View pared down dataset
cleaning_data.head()
# Review the datatypes for each column - this can help guide some of our cleaning.
cleaning_data.dtypes
municipali object dateofsale object price float64 yearbuilt int64 heatedarea float64 cdebuildin object descbuildi object storyheigh object aheatingty object heatedfuel object actype object extwall object foundation object numfirepla float64 bldggrade object fullbaths int64 halfbaths int64 bedrooms float64 units int64 shape_Leng float64 shape_Area float64 sale_year float64 toPredict object geometry geometry dtype: object
Based on the dataset's metadata, we can see what each of the codes mean for a few of the more curious columns. From this, I've decided to limit the data to only single family residental homes (cdebuildin = 01).
I'll also convert the date of sale into a real date.
# Clean building codes - metadata here: http://maps.co.mecklenburg.nc.us/opendata/metadata/tax_parcels_with_cama_data.xml
cleaning_data["cdebuildin"].unique()
#How many differeny types of homes in each category?
cleaning_data.groupby('cdebuildin').size()
# The vast majority (45782) of homes are single family homes. Will limit our analysis to just Single Family Residential homes (cdebuildin = 01)
cleaning_data = cleaning_data.loc[cleaning_data['cdebuildin'] == '01']
#How many differeny types of homes in each category within descbuildi?
cleaning_data.groupby('descbuildi').size()
# All are RES - so we can get rid of this feature (later)
# Convert dateofsale to a date
cleaning_data['dateofsale'] = pd.to_datetime(cleaning_data['dateofsale'])
cleaning_data.head()
municipali | dateofsale | price | yearbuilt | heatedarea | cdebuildin | descbuildi | storyheigh | aheatingty | heatedfuel | ... | bldggrade | fullbaths | halfbaths | bedrooms | units | shape_Leng | shape_Area | sale_year | toPredict | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | HUNTERSVILLE | 2021-07-01 | 375000.0 | 1976 | 1596.0 | 01 | RES | 1 STORY | AIR-DUCTED | GAS | ... | AVERAGE | 2 | 0 | 3.0 | 1 | 856.759354 | 36157.191937 | 2021.0 | MODELLING | POINT (-80.93759 35.43227) |
1 | HUNTERSVILLE | 2021-08-13 | 675000.0 | 2010 | 3244.0 | 01 | RES | 2.0 STORY | AIR-DUCTED | GAS | ... | VERY GOOD | 3 | 1 | 4.0 | 1 | 736.049633 | 26070.156589 | 2021.0 | MODELLING | POINT (-80.93604 35.43186) |
2 | HUNTERSVILLE | 2021-10-25 | 3793500.0 | 2008 | 5843.0 | 01 | RES | 2.0 STORY | AIR-DUCTED | GAS | ... | EXCELLENT | 3 | 1 | 4.0 | 1 | 1133.483555 | 41517.942335 | 2021.0 | MODELLING | POINT (-80.93491 35.43386) |
3 | HUNTERSVILLE | 2020-04-29 | 1125000.0 | 1991 | 4272.0 | 01 | RES | 1 STORY | AIR-DUCTED | GAS | ... | VERY GOOD | 4 | 0 | 4.0 | 1 | 908.459599 | 38364.218018 | 2020.0 | MODELLING | POINT (-80.93474 35.43432) |
4 | HUNTERSVILLE | 2020-02-18 | 915000.0 | 2014 | 3542.0 | 01 | RES | 2.0 STORY | AIR-DUCTED | GAS | ... | VERY GOOD | 4 | 1 | 4.0 | 1 | 1074.433466 | 58523.332015 | 2020.0 | MODELLING | POINT (-80.93803 35.43247) |
5 rows × 24 columns
So that we can use storey heights as a numerical variable, I'll convert the string of storey heights into numbers. Since there are relatively few homes with non-numerical descriptors of the building's height (e.g. "Split Level) compared to the size of the entire dataset, I'll remove these instances from the dataset.
# will need to convert storeyheigh to numeric variable
cleaning_data.groupby('storyheigh').size()
# remove instances without specific storey height
levels_dont_want = ["BI-LEVEL", "CAPE COD", "RANCH W/BSMT", "SPLIT LEVEL"]
levels_want = ["1 STORY", "1.5 STORY", "2.0 STORY", "2.5 STORY", "3.0 STORY", ">=4.0 STORY"]
cleaning_data = cleaning_data.loc[cleaning_data["storyheigh"].isin(levels_want)]
# Convert story levels to numeric
levels_numeric = {"1 STORY":1, "1.5 STORY":1.5, "2.0 STORY":2, "2.5 STORY":2.5, "3.0 STORY":3, ">=4.0 STORY":4}
cleaning_data["storyheigh"] = cleaning_data["storyheigh"].replace(levels_numeric)
#cleaning_data.groupby("storyheigh").size()
cleaning_data.head()
municipali | dateofsale | price | yearbuilt | heatedarea | cdebuildin | descbuildi | storyheigh | aheatingty | heatedfuel | ... | bldggrade | fullbaths | halfbaths | bedrooms | units | shape_Leng | shape_Area | sale_year | toPredict | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | HUNTERSVILLE | 2021-07-01 | 375000.0 | 1976 | 1596.0 | 01 | RES | 1.0 | AIR-DUCTED | GAS | ... | AVERAGE | 2 | 0 | 3.0 | 1 | 856.759354 | 36157.191937 | 2021.0 | MODELLING | POINT (-80.93759 35.43227) |
1 | HUNTERSVILLE | 2021-08-13 | 675000.0 | 2010 | 3244.0 | 01 | RES | 2.0 | AIR-DUCTED | GAS | ... | VERY GOOD | 3 | 1 | 4.0 | 1 | 736.049633 | 26070.156589 | 2021.0 | MODELLING | POINT (-80.93604 35.43186) |
2 | HUNTERSVILLE | 2021-10-25 | 3793500.0 | 2008 | 5843.0 | 01 | RES | 2.0 | AIR-DUCTED | GAS | ... | EXCELLENT | 3 | 1 | 4.0 | 1 | 1133.483555 | 41517.942335 | 2021.0 | MODELLING | POINT (-80.93491 35.43386) |
3 | HUNTERSVILLE | 2020-04-29 | 1125000.0 | 1991 | 4272.0 | 01 | RES | 1.0 | AIR-DUCTED | GAS | ... | VERY GOOD | 4 | 0 | 4.0 | 1 | 908.459599 | 38364.218018 | 2020.0 | MODELLING | POINT (-80.93474 35.43432) |
4 | HUNTERSVILLE | 2020-02-18 | 915000.0 | 2014 | 3542.0 | 01 | RES | 2.0 | AIR-DUCTED | GAS | ... | VERY GOOD | 4 | 1 | 4.0 | 1 | 1074.433466 | 58523.332015 | 2020.0 | MODELLING | POINT (-80.93803 35.43247) |
5 rows × 24 columns
There were only 444 homes that sold for more than $2 million. Since the market for high-priced homes is very different from the general market, I'm going to remove homes that sold for more than $2 million from the dataset.
# How much variation is there in price?
pd.options.display.float_format = '{:20,.2f}'.format # turn off scientific notation
cleaning_data["price"].describe()
len(cleaning_data["price"].loc[cleaning_data["price"] < 2e6])
# just 444 homes sold for over $2 million. I'm going to limit my prediction to just homes that sold for 2 million or less.
cleaning_data = cleaning_data.loc[cleaning_data["price"] <= 2e6]
# # # Year Built # # #
# Yearbuilt - any cleaning needed?
cleaning_data.groupby("yearbuilt").size()
# Yes - remove the one home with a yearbuilt as 0. Otherwise, the earliest home is from 1897, which is reasonable.
cleaning_data = cleaning_data.loc[cleaning_data["yearbuilt"] != 0]
# # # Heated Area # # #
# Heatedarea - any cleaning needed?
cleaning_data["heatedarea"].describe()
# Yes - there is at least one home with 0 heated area.
len(cleaning_data.loc[cleaning_data["heatedarea"] < 500])
# Three homes with less than 500 sqft. I'll remove them from the dataset.
cleaning_data = cleaning_data.loc[cleaning_data["heatedarea"] >= 500]
# # # Heating Type # # #
# aheatingty- any cleaning needed? Might not use this.
cleaning_data.groupby("aheatingty").size() # most homes are air-ducted. It probably won't help the prediction to include this as a predictor.
# # # Heating Fuel # # #
# heatedfuel - might not use this either
cleaning_data.groupby("heatedfuel").size()
# # # AC Type # # #
# actype - I could engineer this into a binary classification: has AC/has no AC. But the nones are so small compared to the larger pool, that it may not be worth it.
# but why not.
cleaning_data.groupby("actype").size()
ac_conditions = [(cleaning_data["actype"] == ("AC-NONE")),
(cleaning_data["actype"] != ("AC-NONE"))]
ac_values = ["no_AC", "has_AC"]
cleaning_data["has_ac"] = np.select(ac_conditions, ac_values)
cleaning_data["has_ac"].head()
cleaning_data.groupby("has_ac").size()
# # # Exterior Wall # # #
# extwall - probably won't use
cleaning_data.groupby("extwall", dropna=False).size()
# # # Foundation # # #
#foundation - probably won't use
cleaning_data.groupby("foundation", dropna=False).size()
# # # Fireplaces # # #
# numfirepla - most homes have one fire place. I will divide this into three categories: no fireplce, one fireplace, or more than one fireplace.
cleaning_data.groupby("numfirepla", dropna=False).size()
fireplace_conditions = [cleaning_data['numfirepla'] == 0,
cleaning_data['numfirepla'] == 1,
cleaning_data['numfirepla'] > 1]
fireplace_categories = ["no_fireplace", "one_fireplace", "2+_fireplaces"]
cleaning_data["fireplace_cat"] = np.select(fireplace_conditions, fireplace_categories)
cleaning_data.groupby("fireplace_cat").size()
# # # Building Grades # # #
#bldggrade - will create two categories: average or better vs below average.
cleaning_data.groupby("bldggrade", dropna=False).size()
good_shape = ["AVERAGE", "CUSTOM", "EXCELLENT", "GOOD", "VERY GOOD"]
bad_shape = ["FAIR", "MINIMUM"]
bldggrade_conditions = [cleaning_data["bldggrade"].isin(good_shape),
cleaning_data["bldggrade"].isin(bad_shape)]
bldggrade_results = ["above_average", "below_average"]
cleaning_data["building_state"] = np.select(bldggrade_conditions, bldggrade_results)
cleaning_data.groupby("building_state").size()
building_state 0 13 above_average 43059 below_average 664 dtype: int64
# how many null values do we have in each column?
# aheating, heatedfuel, actype, and extwall have the largest NA values - good reasons not to use them (since I was already inclined towards that anyway), although the NAs are small in number.
cleaning_data.isnull().sum()
# remove nas
cleaning_data = cleaning_data.dropna()
cleaning_data.isnull().sum()
municipali 0 dateofsale 0 price 0 yearbuilt 0 heatedarea 0 cdebuildin 0 descbuildi 0 storyheigh 0 aheatingty 0 heatedfuel 0 actype 0 extwall 0 foundation 0 numfirepla 0 bldggrade 0 fullbaths 0 halfbaths 0 bedrooms 0 units 0 shape_Leng 0 shape_Area 0 sale_year 0 toPredict 0 geometry 0 has_ac 0 fireplace_cat 0 building_state 0 dtype: int64
# # # Full Baths # # #
# fullbaths - doesn't seem to need any cleaning.
cleaning_data.groupby('fullbaths', dropna=False).size()
# # # Half Baths # # #
# Half baths - appears to be in order
cleaning_data.groupby("halfbaths", dropna=False).size()
# # # Bedrooms # # #
# 'bedrooms', # 0, 50, and 65 bedrooms seems ridiculous - will remove.
cleaning_data.groupby("bedrooms", dropna=False).size()
cleaning_data['bedrooms'] = cleaning_data['bedrooms'].loc[(cleaning_data['bedrooms'] < 50) & (cleaning_data['bedrooms'] > 0)]
cleaning_data.dropna().groupby("bedrooms", dropna=False).size()
cleaning_data = cleaning_data.dropna()
# # # Units # # #
# 'units' # I don't know why so many are listed as having 0 units.
#When I compare their values to others, it appears that they are much newer than other homes (average year built is 2020ish, vs 1990s for 1-unit dwellings).
# So this could be a data entry issue. I will opt not to use this feature.
cleaning_data.groupby("units", dropna=False).size()
# # # Sale Year # # #
# 'sale_year' - sale years range from 2020-2028. 2028 is clearly a mistake (And just one entry). Will remove that one and keep the others.
# Will also convert the data type to a date.
cleaning_data.groupby("sale_year").size()
cleaning_data["sale_year"] = cleaning_data["sale_year"].dropna().loc[cleaning_data["sale_year"] < 2028]
cleaning_data["sale_year"] = pd.to_datetime(cleaning_data["sale_year"], format = '%Y').dt.strftime("%Y")
cleaning_data.dropna().groupby("sale_year", dropna=False).size()
sale_year 2020 15610 2021 17525 2022 10349 dtype: int64
After going through each column, there are only a few columns I want to keep. I'll filter the dataset down to just those columns, and make sure that all the missing values (NAs) have been dropped.
# After all that, there are only a few columns I want to keep. They are:
final_cols = ["municipali",
"dateofsale",
"price",
"yearbuilt",
"heatedarea",
"storyheigh",
"has_ac",
"fireplace_cat",
"building_state",
"fullbaths",
"halfbaths",
"bedrooms",
"sale_year",
"shape_Area",
"geometry"]
# Confirm that all NAs have been dropped
cleaning_data = cleaning_data.dropna()
cleaning_data.isnull().sum()
home_sales = cleaning_data[final_cols]
home_sales.head()
municipali | dateofsale | price | yearbuilt | heatedarea | storyheigh | has_ac | fireplace_cat | building_state | fullbaths | halfbaths | bedrooms | sale_year | shape_Area | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | HUNTERSVILLE | 2021-07-01 | 375,000.00 | 1976 | 1,596.00 | 1.00 | has_AC | one_fireplace | above_average | 2 | 0 | 3.00 | 2021 | 36,157.19 | POINT (-80.93759 35.43227) |
1 | HUNTERSVILLE | 2021-08-13 | 675,000.00 | 2010 | 3,244.00 | 2.00 | has_AC | one_fireplace | above_average | 3 | 1 | 4.00 | 2021 | 26,070.16 | POINT (-80.93604 35.43186) |
3 | HUNTERSVILLE | 2020-04-29 | 1,125,000.00 | 1991 | 4,272.00 | 1.00 | has_AC | one_fireplace | above_average | 4 | 0 | 4.00 | 2020 | 38,364.22 | POINT (-80.93474 35.43432) |
4 | HUNTERSVILLE | 2020-02-18 | 915,000.00 | 2014 | 3,542.00 | 2.00 | has_AC | one_fireplace | above_average | 4 | 1 | 4.00 | 2020 | 58,523.33 | POINT (-80.93803 35.43247) |
5 | HUNTERSVILLE | 2020-08-21 | 880,000.00 | 1992 | 3,520.00 | 1.00 | has_AC | one_fireplace | above_average | 3 | 0 | 3.00 | 2020 | 24,435.14 | POINT (-80.93793 35.43427) |
From mapping out our selected homes by price, we can see that homes tend to be priced higher in the northern tip of the county, and in the "Wedge" in the south of the city. Homes in the center of the county and radiating outward are lower priced.
# Map out homes by price
home_sales = gpd.GeoDataFrame(home_sales).to_crs("EPSG:4326")
pd.options.display.float_format = '{:20,.2f}'.format # turn off scientific notation
# Make a map of our selected homes
#Style
plt.style.use("seaborn-v0_8-whitegrid") #plt.style.available
# Plot
fig, ax = plt.subplots(figsize = (10,10))
home_sales.plot(ax = ax,
column = "price",
legend = True,
cmap = "coolwarm",
alpha = 0.3)
ax.set_title("Home Sale Prices, Mecklenburg NC", fontsize = 16)
ax.set_axis_off()