Identify a marketplace that connects people renting houses with people looking for a place to stay.
This personal project is a case study given by RevoU in a mini course held for 2 weeks to see the participants’ understanding of the Data Analytics material that has been presented, therefore I tried to make several analyzes of this Airbnb which is a marketplace that connects people who renting out houses to people looking for a place to stay.
Airbnb is an online marketplace that connects people who want to rent out their homes with people looking for accommodations in specific locales. The company has come a long way since 2007, when its co-founders first came up with the idea to invite paying guests to sleep on an air mattress in their living room. According to Airbnb’s latest data, it now has more than 7 million listings, covering some 100,000 cities and towns in 220-plus countries and regions worldwide.
The first step is that we have to make sure our data is connected.
# Connect with dataset in drive
from google.colab import drive
"/content/drive") drive.mount(
Import libraries we need to use, like:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random as random
import random
from datetime import datetime, timedelta
import plotly.express as px
import warnings
Read the data
# Read the data
= pd.read_csv("/content/drive/MyDrive/dataset nyc airbnb/AB_NYC_2019.csv") df
This is basic data understandings like:
In the first step of data processing we can use several database understandings to see an overview of the dataset to data type information and check for the duplicates values.
# View the data
df.head() # View the descriptive statistics or overview the dataset
df.describe() # View the info from data like data type and others
df.info() print(df.shape) # Provide the number of rows and columns in the dataset
list(df) # View list of column names
# Check for the duplicates values
df.drop_duplicates()
df.shapeprint(df.isnull().sum()) # Print the number of null values in each column
Second step is fixing error form the data
# Change the data type from object to date
'last_review'] = pd.to_datetime(df['last_review'])
df['review_year'] = df['last_review'].apply(lambda last_review:last_review.year)
df['review_year'] = df['review_year'].fillna(0)
df['review_year'] = df.review_year.astype(int)
df[= pd.concat([df[(df['availability_365']==0) & (df['review_year']==2019)],df[df['availability_365']>0]]) df
#Replace the missing value with 0
'reviews_per_month'] = df['reviews_per_month'].fillna(0) df[
df.info()
# Drop the 'last_review' column
'last_review', axis=1, inplace=True)
df.drop(
# Replace the missing 'name' values by random name
= df.fillna({'name': 'Upper East Side Oasis!'}) # here i fill with 'Upper East Side Oasis!'
df = df.fillna({'host_name': 'john'}) # here i fill with 'john' df
df.head()
df.info()print(df.isnull().sum())
Correlation heatmaps are here to show us how closely related variables are.
=(12,12))
plt.figure(figsize='Blues', annot=True,)
sns.heatmap(df.corr(), cmap plt.show()
To see room type by neighbourhood group
# Create the count plot
=(10, 10))
plt.figure(figsizeset(style="whitegrid")
sns.# Create a count plot with 'room_type' on the x-axis and hue='neighbourhood_group'
=df, x='room_type', hue='neighbourhood_group', palette='viridis')
sns.countplot(data'Count of Room Types by Neighbourhood Group')
plt.title(# Show the plot
plt.show()
To shows Proportion of Airbnb listings across boroughs and room type.
# Create a bar plot borough
'neighbourhood_group'].value_counts() / df.shape[0]).plot.bar(cmap='tab10', title='Proportion of Airbnb listings across boroughs')
(df['Borough')
plt.xlabel('Proportion')
plt.ylabel(=45)
plt.xticks(rotation
plt.show()
# Create a bar plot room type
'room_type'].value_counts() / df.shape[0]).plot.bar(cmap='tab10', title='Proportion of Airbnb listings across room_type')
(df['Room Type')
plt.xlabel('Proportion')
plt.ylabel(=45)
plt.xticks(rotation plt.show()
To see where is the most expensive area?
# Create the bar plot
= sns.catplot(x='neighbourhood_group', y='price', data=df, kind='bar', hue='room_type', palette='viridis')
fig # Add a title and adjust its position
'Where is the most expensive area?', fontsize=15, y=1.05)
fig.fig.suptitle(# Save the plot as an image with tight layout
'most_expensive_area.png', bbox_inches='tight') fig.savefig(
To see where is the most popular area based on availability?
# Create the bar plot
=(10, 6))
plt.figure(figsizeset(style="whitegrid")
sns.# Sort the DataFrame by 'availability_365' in descending order
= df.sort_values(by='availability_365', ascending=False)
df_sorted # Create the bar plot
=df_sorted, x='neighbourhood_group', y='availability_365', palette='viridis')
sns.boxplot(data'Neighbourhood Group')
plt.xlabel('Availability (in days)')
plt.ylabel('Most Popular Neighbourhood Group Based on Availability')
plt.title(# Show the plot
plt.show()
Next, we can visualize it using a box map to see the distribution.
# Visualize Map Box with scatterplot
=(10,10))
plt.figure(figsize='longitude', y='latitude', hue='neighbourhood_group',s=20, data=df, palette="viridis") sns.scatterplot(x
To see who the top 5 by calculated_host_listings_count
# Calculate total reviews per host
= df.groupby('host_name')['calculated_host_listings_count'].sum()
total_host_listings_count # Rank hosts by total listings count
= total_host_listings_count.sort_values(ascending=False)
ranked_hosts # Top 5 host_names based on total listings count
= ranked_hosts.head(5)
top_5_host_names # Display the top 5 host_names
print(top_5_host_names)
# Create a bar plot to visualize the top 5 host_names by total listings
=(10, 6))
plt.figure(figsize='skyblue')
plt.bar(top_5_host_names.index, top_5_host_names.values, color'Host Name')
plt.xlabel('Total Listings')
plt.ylabel('Top 5 Hosts by Total Listings')
plt.title(=45)
plt.xticks(rotation
plt.tight_layout()
# Show the plot
plt.show()
A correlation heatmap is a graphical representation of a correlation matrix representing the correlation between different variables. The value of correlation can take any value from -1 to 1. Correlation between two random variables or bivariate data does not necessarily imply a causal relationship. We can see it as in Fig. 2 below 👇
To create an analysis of this, we first collected data from Airbnb listings and created Airbnb areas based on neighborhood groups such as: Manhattan, Bronx, Brooklyn, Queens, and Staten Island. We then calculated three main room types: Whole house/apartment, Shared room, and Private room in each neighborhood group. We can see the details in Fig. 3.
The results indicate that Manhattan has the highest number of Entire home/apt listings, reflecting its popularity as a tourist destination. Brooklyn follows closely behind, offering a mix of room types. In contrast, the Bronx, Queens, and Staten Island have a higher proportion of Entire home/apt and Private rooms, likely due to their residential nature.
This is Breakdown from Count of Room Types by Neighbourhood Group
In this analysis, we explore the distribution of Airbnb listings across the five major boroughs of New York City: Manhattan, Bronx, Brooklyn, Queens, and Staten Island. We can see details in Fig. 4.
Regarding boroughs in this graph Fig. 4., Manhattan and Brooklyn have the highest proportions of Airbnb listings, reflecting their status as top tourist destinations. Queens also boasts a significant number of listings, while the Bronx offers a more affordable alternative. Staten Island, with its suburban appeal, has the smallest share of Airbnb listings among the five boroughs.
In addition to region, We explored the Airbnb listing distribution of three room types in New York City: Entire home/apt, Private room, and Shared room. We can see details in Fig. 5.
In terms of room types, Airbnb listings in New York City predominantly comprise “Entire home/apt” options, catering to those seeking privacy and convenience. “Private room” listings come next, offering a balance between affordability and comfort. “Shared room” listings are the least common, serving budget-conscious solo travelers or those comfortable sharing living spaces.
To analyze the data, we first gathered information on rental prices for these room types in each borough. Afterward, we calculated the average rental price for each combination of borough and room type. Here are the findings:
Manhattan: Unsurprisingly, Manhattan emerges as the most expensive borough overall, with entire homes/apartments being the costliest, followed by private rooms and shared rooms.
Bronx: In the Bronx, shared rooms are the most affordable option, followed by private rooms and entire homes/apartments.
Brooklyn: Brooklyn exhibits a similar pattern to Manhattan, with entire homes/apartments being the most expensive, followed by private rooms and shared rooms.
Queens: Queens generally offers more affordable accommodations compared to Manhattan and Brooklyn. Entire homes/apartments are the priciest, followed by private rooms and shared rooms.
Staten Island: Staten Island, being the least expensive of the five boroughs, sees private rooms as the most economical choice, followed by shared rooms and entire homes/apartments.
To visualize these findings, we’ve created bar graphs below that represent the average rental prices for each borough and room type. These graphs will help you better understand the price distribution across the different areas and accommodation types. We can see details in Fig. 6.
Please note that the specific rental prices can vary greatly within each borough, and this analysis provides a general overview. If you have access to the relevant data, you can create more detailed analyses and visuals to dive deeper into the specific neighborhoods and factors influencing rental prices in each area.
The box plot analysis depicts the availability of Airbnb listings in the five major boroughs of New York City (Manhattan, Bronx, Brooklyn, Queens, and Staten Island). The ‘availability_365’ metric is used to measure the availability of listings throughout the year. We can see the details in Fig. 7.
Looking at the above categorical box plot we can infer that the listings in State Island seems to be more available throughout the year to more than 300 days. On an average, these listings are available to around more 250 days every year followed by Bronx where every listings are available for around more 150 on an average every year.
Mapbox distribution in New York City shows varying levels of usage across the five boroughs: Manhattan, Bronx, Brooklyn, Queens, and Staten Island. In Manhattan, where the bustling heart of the city resides, Mapbox is likely to see high utilization, with its mapping and location services catering to the demands of businesses, tourists, and residents alike.
To find Top 5 hosts by total listings count.
As we can find that Top 5 host name are Sonder (NYC), Blueground, Kara, Kazuya, and Sonder.
The results indicate that Manhattan has the highest number of Entire home/apt listings, reflecting its popularity as a tourist destination. Brooklyn follows closely behind, offering a mix of room types. In contrast, the Bronx, Queens, and Staten Island have a higher proportion of Entire home/apt and Private rooms, likely due to their residential nature.
We can infer that there are high range of prices across Manhattan followed by Brooklyn and Queens being the most costliest place to stay in NYC.
Understanding these proportions can assist travelers in selecting accommodations that suit their preferences and budgets.
Manhattan has the highest number of Entire home/apt listings, reflecting its popularity as a tourist destination. Brooklyn follows closely behind, offering a mix of room types. Then Manhattan have more expensive places to stay in NYC. Room availability is very low in manhattan and brooklyn and you can find a room anytime in the State Island and Bronx.
If you are looking for the most expensive locations you can come to Manhattan and Brooklyn or Queens. However, if you want to find the places with the highest availability in New York City, you can choose a location like the Bronx with an average of more than 150 days every year or State Island with an average of more than 250 days available every year.
If you see mistakes or want to suggest changes, please create an issue on the source repository.
For attribution, please cite this work as
Santoso (2023, Sept. 29). Agus Santoso: NYC AirBnB Data Analytics Case Study. Retrieved from https://agussantoso05.github.io/index.html/posts/2023-09-29-nycairbnb-data-analytics-case-study/
BibTeX citation
@misc{santoso2023nyc, author = {Santoso, Agus}, title = {Agus Santoso: NYC AirBnB Data Analytics Case Study}, url = {https://agussantoso05.github.io/index.html/posts/2023-09-29-nycairbnb-data-analytics-case-study/}, year = {2023} }