To analyze smart device data to gain insight into how consumers are using their smart devices.
Welcome to the Bellabeat data analysis case study! In this case study, I will perform many real-world tasks of a junior data analyst. I will imagine I am working for Bellabeat, a high-tech manufacturer of health-focused products for women, and meet different characters and team members. In order to answer the key business questions, you will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act.
I am a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. I have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights I discover will then help guide marketing strategy for the company. I will present your analysis to the Bellabeat executive team along with your high-level recommendations for Bellabeat’s marketing strategy.
The first step is that we have to make sure our data is connected.
#Connect google colab with my drive
from google.colab import drive
"/content/drive") drive.mount(
Import libraries we need to use, like:
#import libraries all we need
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random as random
import plotly.graph_objects as go
from plotly.offline import iplot
from plotnine.data import economics
from plotnine import ggplot, aes, geom_line
import warnings
Read the data
#import the data we use
= pd.read_csv("/content/drive/MyDrive/Fitabase Dataset/dailyActivity_merged.csv")
daily_activity = pd.read_csv("/content/drive/MyDrive/Fitabase Dataset/dailyCalories_merged.csv")
daily_calories = pd.read_csv("/content/drive/MyDrive/Fitabase Dataset/sleepDay_merged.csv")
sleep_day = pd.read_csv("/content/drive/MyDrive/Fitabase Dataset/weightLogInfo_merged.csv") weight_log_info
This is basic data understandings like:
In the first step of data processing we can use several database understandings to see an overview of the dataset to data type information and check for the duplicates values.
Breakdown the data to make analysis easier
# View the data
daily_activity.head() # View the descriptive statistics or overview the dataset
daily_activity.describe() # View the info from data like data type and others
ddaily_activity.info() print(daily_activity.shape) # Provide the number of rows and columns in the dataset
list(daily_activity) # View list of column names
# Check for the duplicates values
daily_activity.drop_duplicates()
daily_activity.shapeprint(daily_activity.isnull().sum()) # Print the number of null values in each column
Second step is fixing error form the data
#Change the data type from object to date
'ActivityDate'] = pd.to_datetime(daily_activity['ActivityDate']) daily_activity[
# Create year column
'Year'] = daily_activity['ActivityDate'].dt.year
daily_activity[
# Create month column
'Month'] = daily_activity['ActivityDate'].dt.month
daily_activity[
# Create day column
'Day'] = daily_activity['ActivityDate'].dt.day
daily_activity[
# Create day_of_week column
'Day_of_week'] = daily_activity['ActivityDate'].dt.day_name() daily_activity[
daily_activity.info() daily_activity.head()
check the information after change data type and adding the data with daily_activity.info() and daily_activity.head(). Details like the image below.
Noted: Do the same for daily_calories, sleep_day, and weight_log_info adjusted to the dataset. If all the required datasets have been prepared and processed, you can proceed to the next step.
=(14,14))
plt.figure(figsize='Greens', annot=True,)
sns.heatmap(daily_activity.corr(), cmap plt.show()
# List of activity types to calculate the mean for
= ['VeryActiveMinutes', 'FairlyActiveMinutes', 'LightlyActiveMinutes', 'SedentaryMinutes']
activity_types
# Calculate the average values for each activity type across all days
= daily_activity[activity_types].mean()
average_activity
# Create labels and values for the pie chart
= average_activity.index
activity_labels = average_activity.values
activity_minutes
# Create the pie chart
=(8, 8))
plt.figure(figsize=activity_labels, autopct='%1.1f%%', startangle=140)
plt.pie(activity_minutes, labels'Average Activity Minutes')
plt.title( plt.show()
# Create a specific order for days of the week
= ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
days_order
# Group and sum the Very Active Minutes for each day
= daily_activity.groupby('Day_of_week').agg({'VeryActiveMinutes': 'sum'}).reindex(days_order).reset_index()
veryactive_minutes
# Set the figure size
= (8, 6)
a1 = plt.subplots(figsize=a1)
fig, ax
# Create the bar plot
= sns.barplot(x="Day_of_week", y="VeryActiveMinutes", data=veryactive_minutes, palette="deep")
plot
# Rotate x-axis labels for better readability
=90)
plot.set_xticklabels(ax.get_xticklabels(), rotation
# Set the title and labels
'Total Very Active Minutes by Days')
plot.set_title('Total VeryActiveMinutes Sum')
ax.set_ylabel('Days') ax.set_xlabel(
# Create a specific order for days of the week
= ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
days_order
# Group and sum the SedentaryMinutes for each day
= daily_activity.groupby('Day_of_week').agg({'SedentaryMinutes': 'sum'}).reindex(days_order).reset_index()
sedentary_minutes
# Set the figure size
= (8, 6)
a1 = plt.subplots(figsize=a1)
fig, ax
# Create the bar plot
= sns.barplot(x="Day_of_week", y="SedentaryMinutes", data=sedentary_minutes, palette="deep")
plot
# Rotate x-axis labels for better readability
=90)
plot.set_xticklabels(ax.get_xticklabels(), rotation
# Set the title and labels
'Total Sedentary Minutes by Days')
plot.set_title('Total SedentaryMinutes Sum')
ax.set_ylabel('Days') ax.set_xlabel(
# Create a specific order for days of the week
= ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
days_order
# Group and sum the TotalSteps for each day
= daily_activity.groupby('Day_of_week').agg({'TotalSteps': 'sum'}).reindex(days_order).reset_index()
totalsteps_days
# Set the figure size
= (8, 6)
a1 = plt.subplots(figsize=a1)
fig, ax
# Create the bar plot
= sns.barplot(x="Day_of_week", y="TotalSteps", data=totalsteps_days, palette="deep")
plot
# Rotate x-axis labels for better readability
=90)
plot.set_xticklabels(ax.get_xticklabels(), rotation
# Set the title and labels
'Total Steps by Days')
plot.set_title('Total Steps Sum')
ax.set_ylabel('Days') ax.set_xlabel(
# Create a specific order for days of the week
= ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
days_order
# Group and sum the TotalDistance for each day
= daily_activity.groupby('Day_of_week').agg({'TotalDistance': 'sum'}).reindex(days_order).reset_index()
sedentary_minutes
# Set the figure size
= (8, 6)
a1 = plt.subplots(figsize=a1)
fig, ax
# Create the bar plot
= sns.barplot(x="Day_of_week", y="TotalDistance", data=sedentary_minutes, palette="deep")
plot
# Rotate x-axis labels for better readability
=90)
plot.set_xticklabels(ax.get_xticklabels(), rotation
# Set the title and labels
'Total Distance by Days')
plot.set_title('Total Distance Sum')
ax.set_ylabel('Days') ax.set_xlabel(
# Create a figure and the primary y-axis
= plt.subplots(figsize=(10, 6))
fig, ax1 ='Day_of_week', y='TotalSteps', data=daily_activity, label='Total Steps', color='skyblue', ax=ax1)
sns.barplot(x
# Calculate a scaling factor for Total Distance
= daily_activity['TotalDistance'].max()
total_distance_max = daily_activity['TotalSteps'].max() / total_distance_max
scaling_factor
# Plot Total Distance (scaled) on the secondary y-axis
'TotalDistance_scaled'] = daily_activity['TotalDistance'] * scaling_factor
daily_activity[='Day_of_week', y='TotalDistance_scaled', data=daily_activity, label='Total Distance', color='salmon', ax=ax1)
sns.barplot(x
# Customize the plot
'Total Steps and Total Distance by Day of Week')
plt.title('Day of Week')
ax1.set_xlabel('Count')
ax1.set_ylabel(True)
plt.grid(
# Show the plot
plt.tight_layout() plt.show()
# Create a scatter plot using Seaborn
=(10, 6))
plt.figure(figsize=daily_activity, x="TotalSteps", y="Calories", hue="SedentaryMinutes", palette="coolwarm", alpha=0.7)
sns.scatterplot(data"Total Steps")
plt.xlabel("Calories")
plt.ylabel("Relationship between Total Steps and Calories")
plt.title(True)
plt.grid(
# Add a regression line (similar to geom_smooth in ggplot2)
=daily_activity, x="TotalSteps", y="Calories", scatter=False, color="gray")
sns.regplot(data
plt.show()
= sleep_day.groupby('Day_of_week').agg({'TotalMinutesAsleep':'sum'}).reset_index().sort_values('TotalMinutesAsleep',ascending = False)
sd_days
= (8, 6)
a1 = plt.subplots(figsize=a1)
fig, ax = sns.barplot(x="Day_of_week", y="TotalMinutesAsleep", data=sd_days,palette ="deep")
plot=90)
plot.set_xticklabels(ax.get_xticklabels(),rotation'Total Minutes A Sleep During Each Day')
plot.set_title('Total Minutes A Sleep')
ax.set_ylabel('Days') ax.set_xlabel(
# Define BMI categories based on thresholds
def categorize_bmi(bmi):
if bmi < 18.5:
return 'Underweight'
elif 18.5 <= bmi < 24.9:
return 'Healthy Weight'
elif 25 <= bmi < 29.9:
return 'Overweight'
else:
return 'Obese'
# Apply the categorization to the BMI column
'BMI Category'] = weight_log_info['BMI'].apply(categorize_bmi)
weight_log_info[
# Calculate the percentage of individuals in each BMI category
= weight_log_info['BMI Category'].value_counts(normalize=True) * 100
category_counts
# Create a pie chart or a bar chart
=(8, 8))
plt.figure(figsize=category_counts.index, autopct='%1.1f%%', startangle=140)
plt.pie(category_counts, labels'Percentage of People in Each BMI Category')
plt.title('equal') # Equal aspect ratio ensures the pie chart is circular.
plt.axis(
# Show the chart
plt.show()
# Histogram for WeightKg
=(10, 6))
plt.figure(figsize=weight_log_info, x='WeightKg', bins=20, kde=True)
sns.histplot(data'Weight Distribution (Kg)')
plt.title('Weight (Kg)')
plt.xlabel('Frequency')
plt.ylabel( plt.show()
A correlation heatmap is a graphical representation of a correlation matrix representing the correlation between different variables. The value of correlation can take any value from -1 to 1. Correlation between two random variables or bivariate data does not necessarily imply a causal relationship. We can see it as in Fig. 3 below 👇
The pie chart displays the distribution of active minutes across four categories: Very Active Minutes, Fairly Active Minutes, Lightly Active Minutes, Sedentary Minutes. We can see it as details in Fig. 4 below.
The data extracted from the pie chart concerning average activity minutes paints a striking picture of user habits. It becomes immediately apparent that the vast majority of users, approximately 81.3%, tend to spend a substantial portion of their daily routines in sedentary minutes activities. This is a significant concern as prolonged periods of inactivity can have adverse effects on overall health. Conversely, the chart also reveals a rather alarming statistic: a mere 1.7% of users actively engage in very active minutes.
In our analysis of the Bellabeat dataset, we examined the distribution of ‘Very Active Minutes’ across different days of the week.
This analysis provides valuable insights into user behavior and engagement patterns. It is evident that users tend to accumulate more ‘Very Active Minutes’ during weekdays, with Tuesday standing out as the days when users are most active. This may suggest that users prioritize physical activity during the early part of the week. Interestingly, the activity level starts to decline as we progress towards the weekend, with Friday and Sunday showing lower ‘Very Active Minutes’ on average but saturday but Fridays seeing a slight increase.
The bar chart displays the distribution of Sedentary Minutes. We can see it as details in Fig. 6 below.
The graph, our findings reveal an interesting trend where users tend to have higher ‘Sedentary Minutes’ during weekdays, with Tuesday showing the highest levels of sedentary behavior. This pattern may be indicative of typical workweek routines, where individuals spend prolonged periods sitting at desks or engaging in less active tasks. As the week progresses, there is a decline in sedentary behavior until the end of the week.
The bar graph shows that the highest total steps were on Tuesday and decreased towards the weekend, but there was a jump on Saturday and decreased again on Sunday. We can see it as details in Fig. 7 below.
The bar graph show with Total Distance highest on Tuesday. We can see it as details in Fig. 8 below.
The bar graph show with Total Distance and Total Steps During Each Day.
In our exploration of the Bellabeat dataset, we uncovered a fascinating relationship between ‘Total Steps’ and ‘Calories Burned.’ While the conventional wisdom suggests that more steps lead to more calories burned, our analysis revealed a counter intuitive trend. We observed that certain users, classified as sedentary due to their minimum step count, were still able to burn a significant number of calories, often falling within the range of 1500 to 2500 calories. In contrast, some more active users, who took significantly more steps, burned calories in a similar range. We can see it as details in Fig. 10 below.
In this case we assume the possibility exists that factors beyond step count, such as metabolism, basal metabolic rate, or the intensity of activities, play a substantial role in calorie expenditure.
The bar graph show with Total Minutes A Sleep During Each Day highest on Wednesday. We can see it as details in Fig. 11 below.
Within our dataset by randomly adding the contents of the ‘Fat’ column we obtained,
with Missing Values (NaN)
Fixing Error is detected in the Fat section. I am going to fill the NA values based on the avaliable ones.
Notes: The NA value
'Fat'].value_counts()
weight_log_info[22.0 1
25.0 1
Name: Fat, dtype: int64
#Replace the missing 'Fat' values by values between 22 and 25
'Fat'] = weight_log_info['Fat'].apply(lambda x: random.randint(22, 25))
weight_log_info[
'Fat'].value_counts()
weight_log_info[
output23 21
22 16
24 16
25 14
Name: Fat, dtype: int64
output, 23 = 21 column, 22 = 16 column, 24 = 16 column, 25 = 14 column. We can see it as details in Fig. 12 below.
Within our dataset, a noteworthy observation emerges: a majority of individuals, comprising 50.7%, are classified as being within the healthy weight range. This percentage is notably higher than the 47.8% of individuals categorized as overweight and the 1.5% who fall into the obese category. We can see it as details in Fig. 13 below.
This data underscores a compelling opportunity for health promotion initiatives aimed at encouraging and facilitating individuals to achieve and maintain a healthy weight. Such initiatives can play a pivotal role in enhancing overall public health by reducing the prevalence of overweight and obesity, both of which are associated with various health risks and conditions.
In our analysis of the Bellabeat dataset, we generated a histogram to gain insights into the distribution of weights among users, as represented by the ‘WeightKg’ variable. The histogram provides a visual representation of the frequency or count of individuals falling within different weight ranges. We can see it as details in Fig. 14 below.
Upon examination, we observed a bell-shaped distribution, suggesting that the majority of users have weights clustered around a central value. This central tendency in weight is a common characteristic of populations, with most individuals maintaining weights close to the average.
Furthermore, the histogram revealed that the dataset includes a range of weights, from the lower end to the higher end of the scale. This diversity in weight distribution is essential for tailoring health and fitness recommendations to address the unique needs of individuals with varying weight profiles.
Overall, the ‘WeightKg’ histogram serves as a valuable tool for understanding the weight distribution within the dataset and can inform strategies for promoting and supporting healthy weight management among users.
This analysis provides valuable insights into user behavior and engagement patterns.
In ‘Very Active Minutes’ it is clear that users tend to accumulate more ‘Very Active Minutes’ during weekdays, with Tuesday standing out as the days when users are most active. This may suggest that users prioritize physical activity during the early part of the week.
In our exploration, we uncovered a fascinating relationship between ‘Total Steps’ and ‘Calories Burned.’ While the conventional wisdom suggests that more steps lead to more calories burned, our analysis revealed a counter intuitive trend. We observed that certain users, classified as sedentary due to their minimum step count, were still able to burn a significant number of calories, often falling within the range of 1500 to 2500 calories.
Within our dataset, a noteworthy observation emerges about Percentage of People in Each BMI Category: a majority of individuals, comprising 50.7%, are classified as being within the healthy weight range. This percentage is notably higher than the 47.8% of individuals categorized as overweight and the 1.5% who fall into the obese category.
The Bellabeat dataset is a comprehensive health and fitness dataset that includes fitness tracking, sleep patterns, total steps, calories, even weight. This makes it a valuable resource for individuals who want to make informed decisions about their health.
Bellabeat has the capability to notify users regarding their inactive lifestyle through either the mobile app or directly on the fitness tracker. Given that a significant 81.3% of users do not currently utilize the device for monitoring their health routines, this data could prove highly valuable for devising effective marketing strategies.
If you see mistakes or want to suggest changes, please create an issue on the source repository.
For attribution, please cite this work as
Santoso (2023, Oct. 3). Agus Santoso: Utilizing Health Tech Device Usage Trends to Inform Marketing Strategy: Bellabeat Analysis. Retrieved from https://agussantoso05.github.io/index.html/posts/2023-10-03-utilizing-health-tech-device-usage-trends-to-inform-marketing-strategy-bellabeat-analysis/
BibTeX citation
@misc{santoso2023utilizing, author = {Santoso, Agus}, title = {Agus Santoso: Utilizing Health Tech Device Usage Trends to Inform Marketing Strategy: Bellabeat Analysis}, url = {https://agussantoso05.github.io/index.html/posts/2023-10-03-utilizing-health-tech-device-usage-trends-to-inform-marketing-strategy-bellabeat-analysis/}, year = {2023} }