Feature Engineered Regression Analysis on Airbnb Short Term Rental Market

Introduction

In 2007, a familiar tale unfolded in a modest apartment: two broke, recently graduated young professionals brainstormed ways to cover their exorbitant San Francisco rent. Their solution, which seemed so simple at the time, was to place an air mattress in their living room and advertise the space online for rent as an “Air Bed and Breakfast.” Little did these two roommates know that this concept would eventually evolve into the massive publicly-traded company known as Airbnb today, valued at over $100 billion.

Fundamentally, Airbnb operates as an online marketplace that seamlessly connects people who would like to rent out their properties and living spaces to people who are looking for accommodations in specific locales. Airbnb has carved out a notable chunk of the short term rental space since its inception. With over 7 million listings across 100,000 cities, Airbnb has revolutionized the hospitality industry, providing hosts with an opportunity to earn substantial income. As of December 2023, there are 5 million hosts that have collectively earned over $250 billion dollars in total, with the typical host earning $14,000 dollars per year. With its popularity only growing, putting homes on the Airbnb market has become a lucrative investment opportunity, with many people choosing to convert their homes to full-time Airbnbs as opposed to traditional long-term rentals.

However, like any business venture, there are risks associated with owning and operating an Airnbnb, and hosts must remain knowledgeable and vigilant in order to ensure their property is profitable. Effective pricing of a listing is crucial, as it maximizes profits and minimizes risk. This is where we focus our analysis for this project; we would like to help inform Airbnb hosts, potentially including ourselves one day, on how to price their Airbnb, why they should price it this way, and how to increase the price, and therefore profitably, of their Airbnb.

We achieve this by analyzing the attributes of Airbnb properties, such as the number of bathrooms, to determine their impact on pricing. Additionally, we explore how the significance of these attributes varies across different cities. Ultimately, we develop an algorithm capable of accurately pricing Airbnb properties based on their attributes. We take an additional step by evaluating houses currently on the housing market to identify worthwhile investment opportunities based on our estimated Airbnb listing prices.

Dataset

The data for the project came from two different sources: the website Inside Airbnb, https://insideairbnb.com/, and an Application Programming Interface (API) provided by Scrapeak, https://www.scrapeak.com/. Scrapeak provides data about Zillow (Scrapeak Docs), while Inside Airbnb provides data about Airbnb (Get the Data).

Inside Airbnb is a service that hosts Comma Separated Value (CSV) and GeoJSON files with information about short-term rentals managed through Airbnb for major metropolitan areas across the world. Inside Airbnb scrapes information once a quarter for each metropolitan area they follow. Each of the different files they host only has data from when they scraped the data for that quarter. They do provide historical data going back at most 12 months. Usually it is the most recent quarter plus the two or sometimes three previous quarters.

The files they host for each metropolitan area are:

listings.csv.gz - A detailed information for each listing, compressed with gzip.
calendar.csv.gz - The availability and cost for each night for every listing, compressed with gzip.
reviews.csv.gz - Detailed information about every review for each listing, compressed with gzip.
listings.csv - A summary of the information in listings.csv.gz
reviews.csv - A summary of the information in reviews.csv.gz
neighbourhoods.csv - A list of neighborhoods from the metropolitan area
neighbourhoods.geojson - Shape files for the metropolitan area and its neighborhoods.

The majority of our analysis was focused on the listings.csv.gz for Washington D.C. and other major metropolitan areas. The file contains information about the listing includinging: url, name, description, location, type of domicile, number of bedrooms, number of bathrooms, amenities, price, rating, number of days it is available, etc. It also contains information about the host, the person managing the property, and the neighborhood the property is in. In total, there are seventy five different features Inside Airbnb provides in the listings.csv.gz files.

We also looked at data collected using an API provided by Scapeak. Like Inside Airbnb, Scapeak is not the originator of the data, but unlike Inside Airbnb, Scapeak does not host any data. The Scapeak has endpoints to pull details about a particular listing or information for multiple listings using search results or inputs from Zillow. We focus our time using the listing endpoint. The listing endpoint takes the URL from a search in zillow.com and returns the information for every listing in the search in a JavaScript Object Notation (JSON). The Python code below pulls all home sale listings on Zillow inside Washington D.C.

The API_KEY was obtained by registering on the Scrapeak website and was saved in the file ../keys/scrape. Listing_url was obtained using zillow.com to set the parameters for a search, and then the URL was copied with the search results.

The JSON contains 92 different features for each listing in the results, including location, price, number of bedrooms, number of bathrooms, square footage, and neighborhood, among many more.

Data Cleaning

With all the data collected, we then turned our efforts to cleaning and manipulating the data. Data cleaning is not only essential for creating an accurate model, but also for the rest of the analysis we do. Without a properly cleaned dataset all our methods and procedures would be useless. Fortunately, the .csv files we retrieved from Inside Airbnb, and our main dataset (listings.csv), were inherently fairly clean, however there still were some problems we needed to address and changes we needed to make for the purposes of our use case.

Like most data sets to be used in machine learning models, there were some outliers that we needed to transform or remove. Outliers can greatly reduce model performance, as outliers exert disproportionate influence on a model’s coefficient estimates. The model fits the outliers and the noise instead of learning the underlying meaningful patterns in the data, essentially overfitting. In our dataset, there were some massive Airbnbs with very high prices that would have negatively impacted our statistical methods. We felt it was suitable to just remove these outliers since our model would still be useful for an overwhelming majority of Airbnb hosts. We applied the IQR method on the price to achieve this.

There were also some other small issues with the dataset we needed to remedy. The bedrooms column, which was supposed to contain the integer value of the number of bedrooms of the Airbnb listing, was left blank, as well as the bathrooms column. So, we extracted the number of bedrooms from the name column and the number of bathrooms from the bathrooms_text column using regular expressions. Furthermore, we created a host_length (in days) column using the host_since column, and filtered the data, only keeping listings where the entire property was available for the airbnb. This choice of property type is intended for a better prediction and easier application and comparison with Zillow data.

Lastly, we conducted standard preprocessing steps on our data before modeling. This included converting date columns to date format, removing rows with null values, scaling numerical features to a standard scale, and one-hot encoding categorical variables.

Model Comparison

Among many columns, we chose 18 columns to predict airbnb price in D.C. They are longitude, latitude, accommodates, bathrooms, beds, number of reviews, reviews per month, rating score, accuracy score, cleanliness score, check-in score, communication score, location score, value score, host profile picture, host identity verification, host length, and instant bookability.

Among those variables, accommodates, beds, and bathrooms have the strongest correlation between each other, having 0.85 (accommodates & beds), 0.71 (accommodates & bathrooms), and 0.66 (beds & bathrooms). This is expected as having more bathrooms and beds implies more possible guests in the unit. Those three values also have a relatively high correlation with price. Other than those three, longitude and latitude have negative correlation of -0.46.

We used five different models to predict the price: LASSO, Linear Regression, Generalized Additive Model (GAM), Random Forest, Support Vector Regression (SVR). The choices of models are intentional to have both parametric and non-parametric models. We calculated Mean Squared Error (MSE) to compare their performance on the prediction. For better calculation, the price is log-normalized because it is right-skewed, and all other numerical columns are normalized as well.

For parametric models, they have similar results in both significant features and test MSE. Longitude, latitude, accommodates, bathrooms, beds, rating score, location score, reviews per month, instant bookability are significant variables in both LASSO and Linear regression. Using those features, LASSO regression has 0.1296 test MSE on price prediction with 0.54 R-squared value.

When it comes to Linear Regression, the base model has all those along with value score, host length, and host identity verification as significant variables. This base model also has 0.1296 test MSE with 0.488 R-squared value. Then we tried to take account of the correlation we saw above. Adding interaction terms for accommodates, bathrooms, and beds, the models have the interaction terms significant with slightly less test MSE.

	Beds & Accommodate	Bed & Bathrooms	Bathrooms & Accommodates	Longitude & Latitude
Train MSE	0.1379	0.1372	0.1363	0.1385
Test MSE	0.1271	0.1271	0.1269	0.1291
R-squared	0.4947	0.4972	0.5006	0.4925

The model with an interaction between bathrooms and accommodates has the lowest test MSE of 0.1269 and the model with an interaction between latitude and longitude has the highest test MSE of 0.1291.

Moving on to the non-parametric models, they perform better than parametric models. The Random Forest model performs the best out of all models in terms of test MSE. The top 5 most important variables used in the model based on the variance covered are bathrooms, accommodates, beds, longitude, and latitude. Using all features, the model has a test MSE of 0.1096

Similarly, another non-parametric model SVR performs slightly better than the parametric models. It uses a radial kernel with C = 1. The test MSE was 0.1198 with 0.575 R-squared value. The SVR models with linear kernel and polynomial with the same C value have larger test MSE of 0.1304 and 0.2692 respectively.

Finally, the GAM, which is a semi-parametric model, performs similarly with other non-parametric models. We used smoothing terms for all numerical variables. Latitude, accommodates, bathrooms, beds, number of reviews, rating score, value score, review per month, and host length are both significant on parametric and non-parametric effects. Yet longitude and location score were only significant on parametric effects. The GAM model has 0.1193 test MSE with 0.577 R-squared value.

	LASSO	Linear	Random Forest	SVR	GAM
Test MSE	0.1296	0.1269	0.1096	0.1198	0.1193
R-squared	0.5423	0.5005	0.6115	0.5753	0.5772

As the table highlights, the Random Forest model has the lowest test MSE and the highest R-squared value. The differences of test MSE and R-squared values are not huge. Yet they are good enough to tell the Random Forest is the best model. Overall, R-squared values are not high but they are good enough to compare between the models because all models are fitted using the same number of features.

Through this model comparison, we found the selected features are relatively similar across all models. Longitude, latitude, accommodates, beds, bathrooms, and reviews per month are used in all models as important features to predict the log normalized price. Only parametric models have an instant bookability as a significant feature. Similarly, only the GAM model has a number of reviews as a significant feature.

Another finding we have is that non-parametric models are generally performed better than parametric models. This suggests that there are stronger non-parametric relationships between price and other features than parametric relationships.

City Comparison

In our study, we identified the Random Forest algorithm as the best performer among the various models tested. To deepen our analysis, we implemented this model architecture across different cities, enabling us to extract and compare key features as well as assess model accuracy. The specific parameters of the Random Forest regressor model used in our experiments are detailed as follows:

Parameters	Values
n_estimators	200
max_depth	11
min_samples_aplit	5
mikn_samples_leaf	3

In this research, we evaluated the performance of a Random Forest model across multiple cities, including Athens, Bangkok, Barcelona, Boston, Hawaii, Los Angeles, Mexico City, New York, Paris, Rome, Seattle, Singapore, and Washington DC. We measured the model’s accuracy by calculating the Mean Squared Error (MSE) and R-squared (R²) for each city’s validation dataset. MSE values ranged from 24.3 in Athens to 140.4 in Mexico City. Despite tuning the hyperparameters to prevent overfitting, there was a noticeable variance between the training and test datasets, although this varied by city.

For R-squared, results for the testing dataset varied from 0.3 to 0.6 depending on the city. Beyond traditional metrics such as R-squared and MSE, we conducted further analyses by plotting both the predicted and actual prices on a histogram to assess the accuracy of the model in capturing the distribution of prices. The histograms indicated that the model generally captured the distribution of prices well, although there were challenges in accurately predicting lower values in cities where the actual price distributions were right-skewed.

The contribution of individual features to the model’s predictive power also varied by city, suggesting localized patterns that may influence Airbnb pricing. Key features influencing price predictions differed across cities, revealing potential areas for further interpretation and application. A summary of the most influential features for selected cities is presented in the table below.

Rank	Singapore	Hawaii	Rome	Washington DC
1	host since (0.20)	bathrooms (0.34)	bathrooms (0.28)	bathrooms (0.28)
2	minimum nights (0.10)	latitude (0.13)	Centro Storico (0.09)	accommodates (0.09)
3	beds (0.09)	accommodates (0.05)	longitude (0.07)	latitude (0.06)
4	longitude (0.07)	longitude (0.05)	Number of reviews (0.06)	longitude (0.06)
5	accommodates (0.07)	Lahaina (0.04)	Location Score (0.06)	location score (0.06)
6	latitude (0.06)	minimum nights (0.03)	accommodates (0.05)	reviews per month (0.04)

In our analysis of Airbnb pricing across multiple cities, we found that the number of bathrooms consistently emerged as a significant predictor of pricing, often having a more substantial impact than the number of accommodates. This finding suggests that potential guests may prioritize the availability of sufficient bathroom facilities, which may allow for more guests than the official accommodation capacity suggests, thus reflecting a preference for convenience and comfort in lodging arrangements.

Beyond the ‘bathrooms’ and ‘accommodates’ features, the interpretation of influential factors requires an understanding of each city’s unique context and cultural nuances. For instance, in Washington DC, geographical coordinates and location scores were among the top predictive features. This indicates the importance of specific neighborhoods for tourists, with areas in the Northwest generally preferred over those in the Southeast. This preference may be driven by the proximity to key government buildings, museums, and cultural events that define the appeal of the nation’s capital.

Conversely, in Rome, the ‘Centro Storico’ feature—a one-hot encoded variable indicating whether an Airbnb is located in this historic central district—highlighted the significance of specific neighborhoods. Being situated in ‘Centro Storico’ is highly valued, likely due to its proximity to major historical sites and cultural amenities, making it a prime location for tourists seeking to immerse themselves in the rich history and vibrant life of Rome.

In Singapore, a distinct pattern emerged where the ‘host since’ feature, which calculates the duration since the host started on Airbnb, was the most significant. This pattern could be attributed to the city’s compact geography, excellent public transportation, uniformly high public safety, and dispersed tourist attractions, which diminish the importance of location compared to other cities. Instead, the experience and reliability of the host appear to be more crucial, suggesting that in well-facilitated urban environments, the reputation and service quality of hosts become significant determinants of pricing.

These findings indicate that while it is feasible to predict Airbnb pricing based on specific attributes, the factors influencing prices vary significantly from city to city, reflecting local preferences and conditions. This variability emphasizes the importance of localized knowledge when applying predictive models in real estate and hospitality contexts. Understanding these local dynamics can enable hosts to optimize their listings to better meet market demands and assist policymakers in crafting regulations that reflect the unique characteristics of their cities. This study not only sheds light on the complex interplay between property features and market pricing but also highlights the adaptability required in predictive modeling for diverse urban markets.

Future Works

One of the things we looked into was predicting the profitability of Airbnbs given features in common between the data provided by Inside Airbnb and Scrapeak. Our goal was to see if we could predict how much an Airbnb made in a thirty day period, a function of the daily cost times how many days it is unavailable. After cleaning the data the two sources had five features in common: latitude, longitude, number of bathrooms, number of bedrooms, and neighborhood. We used linear regression and neural networks with multiple hyper-parameters to predict how much a listing on Airbnb makes in a month.

The best model had a loss, mean absolute error, of 1940.11 on the training set, 2073.94 on the validation set, and 1815.11 on the test set. The information could be used to assist with financial decisions but leaves a lot to be desired. Obviously, there is more that goes into having a good Airbnb than the location, number of bedrooms, and number of bathrooms. We would like to investigate this further if we can obtain more data.

One way we could get more data would be to pull listings.csv.gz multiple times over a number of years. A script that checks the date of the data and pulls the file if it has been updated would be an easy way to add more samples to our data set. This course of action was not possible during the project due to our mandated start and deliverable dates.

Another option would be using paid data sources that have more features we can use and or more samples. This was also not possible given we had no funding to purchase data. Inside Airbnb offers additional data sets but they cost $500 per metropolitan area. Without funds for data or additional years to collect it we decided not to follow this part of our project further.

Conclusion

In summarizing our analysis, we’ve scrutinized the intricacies of Airbnb pricing through an approach encompassing data collection, data preprocessing, model selection, and city-specific comparisons. From the humble beginnings of Airbnb’s inception to its current stature as a global hospitality powerhouse, our investigation has shed light on the multifaceted determinants shaping pricing dynamics in the short-term rental market.

Our data collection efforts, sourced from Inside Airbnb and Scrapeak, yielded a rich dataset with crucial features influencing Airbnb pricing. Through meticulous cleaning and manipulation, we curated a dataset primed for predictive modeling, laying the foundation for a detailed exploration of pricing trends.

In our model comparison phase, we fit the data to LASSO, Linear Regression, Generalized Additive Model (GAM), Random Forest, Support Vector Regression (SVR) algorithms. Random Forest emerged as the best choice with the highest R-squared value and lowest test Mean Squared Error. The Random Forest model offers actionable insights for hosts aiming to optimize their pricing strategies, allowing hosts to understand what attributes of their Airbnb are most important for pricing in their particular city and what the price should be. Overall, we found the number of bathrooms to be the most important factor.
Venturing beyond model evaluation, our cross-city analysis unveiled localized patterns and nuances, highlighting the diverse factors influencing pricing dynamics across different urban landscapes.

While our findings provide valuable guidance for hosts, there is still work to be done.The pursuit of forecasting Airbnb profitability calls for further refinement and exploration, opening avenues for deeper investigation. Despite constraints in data availability and resources, our study sets the stage for future investigations into this domain.

Whether individuals are current hosts aiming to maximize their Airbnb’s potential or prospective homebuyers seeking investment opportunities, our analysis offers valuable insights. Access to information can provide a competitive advantage that others may not have, and given the gravity of the affair and risk involved with buying a house, information can be the differentiating factor between success or crippling failure. In conclusion, our Airbnb pricing analysis underscores the importance of data-driven methodologies in uncovering valuable information on complex topics. This information can empower individuals, whether they are hosts or potential investors, to make informed decisions that enhance their prospects in various aspects of life.

Sources / References

Get the Data. Inside Airbnb. https://insideairbnb.com/get-the-data/
Scrapeak Docs. Scrapeak. https://docs.scrapeak.com/