Exploring Fitbit Data Using FitBit's API

Google Data Analytics Specialization Capstone Project

Bellabeat is a high-tech manufacturer of health-focused products for women. As a junior data analyst working with marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. I have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. Urška Sršen is confident that an analysis of non-Bellebeat consumer data (ie. FitBit fitness tracker usage data) would reveal more opportunities for growth. The insights from the data will help to guide marketing strategy for the company. I have performed analysis on data along with high level recommendations for Bellabeat’s marketing strategy.

Business Task: Analyze FitBit fitness tracker data to gain insights into how consumers are using the FitBit app and discover trends for Bellabeat marketing strategy.

Ask Phase

Firstly, we need to address who are our key stakeholders? In this case, we have following stakeholders:

Urška Sršen: Bellabeat’s co-founder and Chief Creative Officer
Sando Mur: Mathematician and Bellabeat’s co-founder; key member of the Bellabeat executive team
Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy.

Business Objectives:

What are some trends in smart device usage?
How could these trends apply to Bellabeat customers?
How could these trends help influence Bellabeat marketing strategy?

Prepare Phase

Sršen encouraged me to use public data that explores smart device users’ daily habits. She points me to a specific data set:

FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius): This Kaggle data set contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits. Data is publicly available on Kaggle: FitBit Fitness Tracker Data and stored in 18 csv files.

In the Prepare phase, we identify the data being used and its limitations:

Data is collected 7 years ago in 2016. Users’ daily activity, fitness and sleeping habits, diet and food consumption may have changed since then. Data may not be timely or relevant.
Sample size of 30 FitBit users is not representative of the entire fitness population.
As data is collected in a survey, So we can not be assure about its integrity or accuracy.

Is the data ROCCC?

A good data source is ROCCC which stands for Reliable, Original, Comprehensive, Current, and Cited.

Reliable — LOW — Not reliable as it only has 30 respondents
Original — LOW — Third party provider (Amazon Mechanical Turk)
Comprehensive — MED — Parameters match most of Bellabeat products’ parameters
Current — LOW — Data is 7 years old and may not be relevant
Cited — LOW — Data collected from third party, hence unknown Overall, the dataset is considered bad quality data and it is not recommended to produce business recommendations based on this data

I have downloaded the data from secure browser in my secured hard disk. And stored under a secured folder inside the file.

Process Phase

In this phase we will process the data by cleaning and ensuring that it is correct, relevant, complete and error free.

We have to check if data contains any missing or null values
Transform the data into format we want for the analysis

Tool:

I have used RStudio for data cleaning, data transformation, data analysis and visualization.

Firstly, we need to install and read the packages we need for analysis: I have all packages installed, so I read all the packages simultaneously.

install.packages("skimr")

install.packages("lubridate")

install.packages("sqldf")

install.packages("janitor")

install.packages("plotrix")

install.packages("tidyverse")

library(sqldf) #For using SQL queries

library(skimr) #For summarizing data

library(dyplr) #For data manipulation

library(ggplot2) #For data visualization

## -- Attaching packages ---------------------------------------- tidyverse 1.3.1 ----

## v ggplot2 3.3.5 v purrr 0.3.4

## v tibble 3.1.6 v dplyr 1.0.7

## v tidyr 1.2.0 v stringr 1.4.0

## v readr 2.1.2 v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --

## x dplyr::filter() masks stats::filter()

## x dplyr::lag() masks stats::lag()

## Loading required package: gsubfn

## Loading required package: proto

## Loading required package: RSQLite

## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':

## chisq.test, fisher.test

We can read the data stored from secured hard disk with help of command read.csv and store them in a variable of our choice.

library(readr)
daily_activity <- read_csv("Desktop/fitbit_data_2016/dailyActivity_merged.csv")
View(daily_activity)

library(readr)
daily_sleep <- read_csv("Desktop/fitbit_data_2016/sleepDay_merged.csv")
View(daily_sleep)

library(readr)
weight_log <- read_csv("Desktop/fitbit_data_2016/weightLogInfo_merged.csv")
View(weight_log)

Next, we need to check for any nulls or missing values in each data. We'll use the following commands to check.

str(daily_activity)

spec_tbl_df [940 × 18] (S3: spec_tbl_df/tbl_df/tbl/data.frame)

$ Id : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...

$ ActivityDate : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...

$ TotalSteps : num [1:940] 13162 10735 10460 9762 12669 ...

$ TotalDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...

$ TrackerDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...

$ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...

$ VeryActiveDistance : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...

$ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...

$ LightActiveDistance : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...

$ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...

$ VeryActiveMinutes : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...

$ FairlyActiveMinutes : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...

$ LightlyActiveMinutes : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...

$ SedentaryMinutes : num [1:940] 728 776 1218 726 773 ...

$ Calories : num [1:940] 1985 1797 1776 1745 1863 ...

skim(daily_activity)

skim_variable n_missing complete_rate mean sd p0 p25 p50 1 Id 0 1 4.86e+9 2.42e+9 1503960366 2320127002 4.45e+9 2 TotalSteps 0 1 7.64e+3 5.09e+3 0 3790. 7.41e+3 3 TotalDistance 0 1 5.49e+0 3.92e+0 0 2.62 5.24e+0 4 TrackerDistance 0 1 5.48e+0 3.91e+0 0 2.62 5.24e+0 5 LoggedActivitiesDistance 0 1 1.08e-1 6.20e-1 0 0 0 6 VeryActiveDistance 0 1 1.50e+0 2.66e+0 0 0 2.10e-1 7 ModeratelyActiveDistance 0 1 5.68e-1 8.84e-1 0 0 2.40e-1 8 LightActiveDistance 0 1 3.34e+0 2.04e+0 0 1.95 3.36e+0 9 SedentaryActiveDistance 0 1 1.61e-3 7.35e-3 0 0 0 10 VeryActiveMinutes 0 1 2.12e+1 3.28e+1 0 0 4 e+0 11 FairlyActiveMinutes 0 1 1.36e+1 2.00e+1 0 0 6 e+0 12 LightlyActiveMinutes 0 1 1.93e+2 1.09e+2 0 127 1.99e+2 13 SedentaryMinutes 0 1 9.91e+2 3.01e+2 0 730. 1.06e+3 14 Calories 0 1 2.30e+3 7.18e+2 0 1828. 2.13e+3

head(daily_activity)