Exploring Bike Sharing Data in DC, Part 2

In this post we will parse the latitude and longitude information of the Capitol Bike Share stations, so we can analyze distance traveled in every trip (in another post we will explore a more realistic way of calculating distance traveled). The location information is provided in real time in xml format. To parse it we will need the urlib2 and the xml libraries. The first step consists in saving the address as a string, then we ‘open’ the website and parse its contents.

import pandas as pd
import plotly.plotly as py
import random
import matplotlib.pyplot as plt
import plotly.tools as tls
import xml.etree.ElementTree as ET
import urllib2
from math import radians, cos, sin, asin, sqrt
py.sign_in('yourusername', 'yourkey')

#we parse the data using urlib2 and xml
doc = ET.parse(htm)

To extract the information of the xml page we retrieve the root of the tree and find the tag, in this case ‘stations’, that contains the fields we are interested in extracting. There are several variables in the xml document, but we are interested in four; terminalName, name, lat, and lon. We append these elements to empty lists and then we use the zip function to create a tuple and then pass it to a data frame.

#we get the root tag
#we define empty list to get the terminal id, latitude and longitude.
#we now use a for loop to extract the information we are interested in
for country in root.findall('station'):
#terminal_id = int(country.find('terminalName').text)
#latitude = float(country.find('lat').text)
#ln= float(country.find('long').text)

#we use the zip to create tuples and the parse them to a dataframe
bike_st= pd.DataFrame(data = prov, columns=['terminalName', 'Lat','Lon'])

Before we merge this data with the bike trip information we gathered in our previous tutorial we rename the columns so we have a common variable to merge and so we can distinguish between origin and destination of the trip. We repeat this process twice so we have latitude and longitude destinations for the origin and destination stations in our dataset.

#we now open the file we use in our previous post.
#we rename the terminal name variable in our station

bikes_2012=bikes_2012.merge(bike_st, how="left")

#we rename the terminal name variable to merge with the end destination
##we merge
bikes_2012=bikes_2012.merge(bike_st, how="left")

As a side note, the reason why I choose to merge on the column that contains the terminal id rather than the terminal name is due to the fact that capitol bike share sometimes uses variations of the station name. This can create a lot of conflict so for now let’s just stick with terminal id information as our merging column.

To get the distance between the bike stations I used a function that computes the haversine formula to calculate distances between our latitude and longitude coordinates. However, before we do this we check for missing values.

##we define our distance function in miles
def distance(lon1, lat1, lon2, lat2):
#transform to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
km = 6367 * c
#transform to miles
mile= km * 0.621371
return mile

#we create a column with the distance between trips
bikes_2012['distance']=bikes_2012.apply(lambda x: distance(x['Lon_or'],
x['Lat_or'], x['Lon_end'], x['Lat_end']),
#we check for missing filessee the columns in our dataframe

There are some missing observations from a station that no longer exist, the White House station on 17th & State Pl NW. In total it had 600 trips originating from there and 630 trips ending there. Additionally there were 31 trips that ended in the Capital Bike Share warehouse. If we filter for these missing values we get 1,824,514 observations in our dataset.

#we create the list of missing values to identify the stations with no information
#we filter the missing values

With this information we now see that the average distance of a bike ride in 2012 was 1.17 miles, and the median was 0.99 miles. Further, the total distance traveled by Bike Share capital riders was an astonishing 2.1 million miles, 9 times the distance of earth to the moon or the equivalent of the entire network of natural gas distribution pipelines in the USA.


Let’s figure out what the distribution of ridership distance is. To do that we will use a sample of our data and make use of the graphic capabilities of plotly.

#plot using a random sample
rows= random.sample(bikes_2012_com.index, 40000)
#histogram of distribution of distance
fig=plt.hist(duration_sample['distance'].values, bins=20, alpha=0.5)
plt.title('Capital Bikeshare Distance of Trips 2012')
mpl_fig1 = plt.gcf()
py_fig1 = tls.mpl_to_plotly(mpl_fig1, verbose=True)
py.iplot_mpl(mpl_fig1, filename='duration_hist_secondpost')

We can see that a large part of the trips traveled between half and 1 and half miles.

Capital Bikeshare Distance of Trips 2012

The most popular station, both for arrival and departures as well as longest distance traveled, was Massachusetts Ave & Dupont Circle NW with 53,786 travels and 56,270.1 miles spent on the road. Other popular stations were Columbus Circle / Union Station, 15th & P St NW, and Congress Heights Metro. The least utilized metro in 2012 was Potomac Ave & 35th St S in Arlington VA with only 100 travels originating from there for a total distance covered of 95.5 miles.

#we see what station had the most trips
'start_terminal']).count().sort('distance', ascending=False)

#we see what station traveled the largest distance
'start_terminal']).sum().sort('distance', ascending=False)

Now let’s see how stations differ on the amount of trips departing and arriving. We will use Mapbox to visualize this data (we will have a tutorial on Mapbox in the future). We first need to export our data to csv format before we can use mapbox.

#data for map 1, net inflow and outflow
#new dataframe
flow_stations=pd.merge(start_s,end_s, left_index=True, right_index=True)
#we get the info of the station id and parse it to a new column
#merge with station data
bike_st2= pd.DataFrame(data = prov1, columns=['station', 'lat','lon','station_name'])
flow_stations=flow_stations.merge(bike_st2, how='left', on='station')
#we create the net outlow and inflow amount of trips
#data for map2
distance_flow=distance_flow.merge(bike_st2, how='left', on='station')
##export dataa

In the following map we can see that all the stations in the northern part of DC were departure (outbound) points, and we can see this trend very clearly north of Florida Avenue. This may not be surprising giving the fact that the city gets hilly up there.

Map 1: Capitol Bike Share Stations, Inbound and Outbound Stations in 2012

In terms of median distance traveled per trip we also see a similar pattern; stations that are located up north in the DC area tend to have longer commutes versus the downtown trips. We can see that 16 St Heights, University Heights and Petworth to a lesser degree all have longer trips than the average downtown station.

Map 2: Median Distance Traveled per Trip in 2012, Capitol Bike Share Stations

In the next post we will explore how the different stations connect to each other as well a few other interesting aspects of the bikeshare system.

Exploring Bike Sharing Data in DC, Part 2

Making Choropleth Maps in R

A couple weeks back, I posted a county choropleth map I made from Census County Business Patterns (CBP) data using R and it’s base maps library. Here’s a quick how-to. I used a few great resources, including this, this, and the official R maps documentation.

First, we download CBP data from here. Unzip it and save the .txt file to wherever you point your R directory. OK, let’s bring in the data and set up the libraries (maps, mapproj, ggplot2, and Cairo).

#set wherever you've stored your unzipped CBP data with the cbp12co.txt file within

#these are the libraries we'll be using -- maps, mapprojections, ggplot2 for histograms, and cairo for exporting clean-looking graphics

The base maps library has detailed data along with its maps, including FIPS codes for counties and states, which will be super useful for us.

#the maps library contains several data files pertaining to the maps within: world with countries, states, counties. Call the countydata set

#read in the data
cbp <- read.csv("cbp12co.txt")

Now that we’ve brought in the data and set up the libraries, we’ll identify the type of establishment we’re interested in looking at by NAICS code, a full list of which you can find here.

#here's the first subset we create, based on the 5-digit NAICS(445110) for grocery store establishments. a full list of 2-6 digit NAICS can be found here, which can be used to parse and aggregate the data: [LINK]
groc <- cbp[ which(cbp$naics=='44511/'), ]
#we're using C-style sprint command to format the county FIPS as text. We want FIPS in the format 01001, with the first two digits representing states and the last three representing county codes. 
groc$fipscty <- sprintf("%03d", groc$fipscty)
groc$fips <- paste0(groc$fipstate, groc$fipscty)
grocdata <- aggregate(cbind(groc$est, groc$emp), by=list(groc$fips), FUN=sum, na.rm=TRUE)
names(aggdata)[1:3] <- c('fips', 'grocest', 'grocemp')

#And here's the second subset, using the 6-digit NAICS (722513) for fast food establishments. 
fast <- cbp[ which(z$naics=='722513'), ]
fast$fipscty <- sprintf("%03d", fast$fipscty)
fast$fips <- paste0(fast$fipstate, fast$fipscty)
fastdata <- aggregate(cbind(fast$est, fast$emp), by=list(fast$fips), FUN=sum, na.rm=TRUE)
names(fastdata)[1:3] <- c('fips', 'fastest', 'fastemp')

fastgrocery <- merge(fastdata, grocdata, by="fips")
fastgrocery$ratio <- fastgrocery$fastest / fastgrocery$estgroc
fastgrocery$dummy <- 1

Now that we’ve merged in our data and setup our dataframe the way we want it, we’ll start to make a map.

#OK, enough processing; let's a make a map
#First, we're going to use the Cairo package; Cairo allows you to create anti-aliasing files so they look super clean
#Call Cairo and set the file name before starting
Cairo(file="C:/Users/SKulkarni/Desktop/fastfood1.png", type="png")
#Now we set parameters, including colors
colors = c("#C5FA80", "#7FC372", "#488D5E", "#205944", "#072923")
#Here we chop up the data into intervals; here I'm using quintiles
fastgrocery$colorBuckets <- cut(fastgrocery$ratio, c(0, 1.875, 3, 3.73, 4.32, 100))
#This is how you match up the intervals and color schema to county FIPS codes
colorsmatched <- fastgrocery$colorBuckets[match(county.fips$fips, fastgrocery$fips)]
#We draw a map and use a Polyconic projection with a transparent background
map("county", col = colors[colorsmatched], resolution = 0,lty = 0, fill=TRUE, projection = "polyconic", bg = "transparent")
#Here we're putting on top another county map that's blank but has thin white lines
map("county", col="grey", fill=FALSE, add=TRUE, lty=1, lwd=.00000001, projection="polyconic", bg="transparent")
#Here we throw on a state map with white colored borders
map("state", col = "white", fill = FALSE, add = TRUE, lty = 1, lwd = 2, projection="polyconic", bg = "transparent")
title("Fast Food to Grocery Store Establishments, 2013")
legend("bottomleft",legend = z, fill=colors, title="Ratio",bty="n")
#Don't forget to use this command to close out and save your plot once you like what you see

That’s it! The base maps package is nimble and works well with county and state level data. The data along with it can also be used with lat-long coordinates, which I’ll try to get into next time.

Making Choropleth Maps in R

Fast Food Counties

Slow food movement be damned; we’re a nation of fast foodies. In 2013, there were 224,759 fast food restaurants across the U.S. Americans love hitting the drive-thru so much that we have almost three and a half burger, taco, and chicken joints for every grocery store in the U.S. Parts of the country — particularly places in the West and the South — seem to really dig stopping off on the way home (or maybe on the way to work — AM Crunchwrap anyone?)

Using a combination of Census County Business Patterns data and R we can painlessly process and create a county choropleth map from CBP data to quickly figure out if there’s a clear spatial relationship between fast food joints and grocery stores — or between any other NAICS-based industries, for that matter.

OK, let’s explore. Census regions provide a good first look. The South (the largest Census region — which cuts a wide swath of states from from Maryland to Texas) has over four fast food joints for each grocery store. By the way, I’m using the North American Industrial Classification System (NAICS) to classify establishments from CBP as grocery (NAICS 44511) or fast food (NAICS 722513 “Limited-service restaurants”). The West and the Midwest aren’t far behind, with 3.5 fast food places for each grocery store. The Northeast is significantly further behind, with less than two fast food stops for each grocery store.

Mapping this ratio by county makes things a little less clear:



A quick note: the CBP counties list and the “maps” package have differing counts of counties — I’ll be discussing this next time with some suggestions for how to work around it. A couple things do jump out, though. Ohio looks like it has a lot of counties in the top quartile — which makes sense since Columbus is America’s test kitchen when it comes to new fast food products. Texas, New Mexico, Arizona, and Nevada look like they prefer a quick stop to cooking a full meal, and taking a look at the top states bears this out:

polyname ratio fastest grocest
nevada 6.516517 2170 333
utah 5.980114 2105 352
new mexico 5.409091 1428 264
texas 5.30805 18265 3441
arizona 5.13697 4238 825
colorado 4.832494 3837 794
indiana 4.585462 4668 1018
alabama 4.551634 3482 765
oklahoma 4.332253 2673 617
idaho 4.297521 1040 242

There is quite a bit of variation between counties though, as a ggplot2 histogram will bear out:


Visually, it seems like density or commuting would play a big role in the ratio, but I wouldn’t be surprised if demographics like poverty can explain a lot, too. Testing against obesity rates may also reveal something about how the fast-food grocery ratio affects our waistlines. That’s for a future post.

Next time, I’ll cover how to use R to quickly process data from the CBP and then how I deployed the R “maps” library to make the choropleth map above. I used a lot of great tutorials from around the web — starting with a very helpful post from Revolution Analytics — but also others around the net. I’ll move the ball on how to map CBP data with the “maps” library built-in map data and hopefully save some enterprising readers some time.

Fast Food Counties

Exploring Bike Sharing Data in DC, Part 1

Since we love biking at Plots of Dots (Nico has at least 10, although we stopped counting once he hit 4) we decided to launch the website with a little tutorial on how to obtain and manipulate data from Capital Bikeshare to produce some interesting descriptive stats on bike sharing usage in Washington D.C. For no other reason rather than randomness, I decided to analyze data for 2012.

First, we need to download the data and load it into a csv file. For this task we will use the Python package pandas. The first step is getting the data from the Capital Bikesharedata download page. As you can see the data are divided in quarterly data.

I first wrote a script to fetch the data directly from the webpage and append all the info of one year into one pandas data frame. However after running the script I realized that Capital Bikeshare changes the name of and order of the columns in each file, thus making this procedure more complicated. So after a few minutes of frustration I realized it was easier to download the information of a single year into a dictionary that contains the four quarterly data frames. This is the code used to get the data, which by the way, takes a while.

#import the necessary libraries
import pandas as pd
import random
import plotly.plotly as py
import plotly.tools as tls
from plotly.graph_objs import *
from pylab import *
import matplotlib.pyplot as plt
py.sign_in('your username', 'your key')

#We define the name of the different files in a list so we can use the list comprenhension
#to append them into a dictionary
periods = ['2012-1st-quarter','2012-2nd-quarter', '2012-3rd-quarter', '2012-4th-quarter']
url = 'https://www.capitalbikeshare.com/assets/files/trip-history-data/{}.csv'

#we append the different data frames to the dictionary
d = {period: pd.read_csv(url.format(period)) for period in periods}
#we verify the length and type of our structure

Before merging our data I wanted to understand better the structure of the data, and after exploring it, I realized that the first and second quarter data frames has 10 columns and that the third and fourth quarters has 9 columns.

#we verify the contents of the first dataframe
#we rename the columns
d['2012-1st-quarter'] = d['2012-1st-quarter'].rename(columns={'Duration': 'duration',
                        'Duration(Sec)': 'duration_sec',
                        'Start date': 'start_date', 'Start Station':'start_station',
                        'Start Terminal':'start_terminal','End date':'end_date',
                        'End Station':'end_station', 'End Terminal':'end_terminal',
                        'Bike#':'bike_number', 'Type':'u_type'})

#we verify the new names of the columns

#we rename our columns from the second data frame
d['2012-2nd-quarter'] = d['2012-2nd-quarter'].rename(columns={'Duration': 'duration',
                        'Duration (sec)': 'duration_sec',
                        'Start date': 'start_date', 'Start Station':'start_station',
                        'Start terminal':'start_terminal','End date':'end_date',
                        'End Station':'end_station', 'End terminal':'end_terminal',
                        'Bike#':'bike_number', 'Bike Key':'u_type'})
#we repeat the process with the third dataframe

d['2012-3rd-quarter'] = d['2012-3rd-quarter'].rename(columns={
                        'Duration': 'duration',
                        'Start date': 'start_date',
                        'Start Station':'start_station',
                        'Start terminal':'start_terminal','End date':'end_date',
                        'End Station':'end_station', 'End terminal':'end_terminal',
                        'Bike#':'bike_number', 'Subscriber Type':'u_type'})
#transform to datetime our two data variables
#we calculate the time elapsed between the trips in seconds

#we rename the variables for the 4th quarter
d['2012-4th-quarter'] = d['2012-4th-quarter'].rename(columns={
                        'Duration': 'duration',
                        'Start date': 'start_date', 'Start Station':'start_station',
                        'Start terminal':'start_terminal','End date':'end_date',
                        'End Station':'end_station', 'End terminal':'end_terminal',
                        'Bike#':'bike_number', 'Subscription Type':'u_type'})

#we get transform our date variables
#we calculate the time elapsed between the trips in seconds

The columns that I used for the analysis contain information regarding the name of the bike station, id of the station, start date of the trip, name of terminal destination, id of terminal station, etc. We rename the columns in the first 2 quarterly data frames so we can append them later. This is final list of columns to use:
columns=[‘start_terminal’,’start_date’,’start_station’, ‘end_terminal’,’end_date’, ‘end_station’,

#we now define the number of columns and the order before merging them all togheter.
columns=['start_terminal','start_date','start_station', 'end_terminal','end_date', 'end_station',

#we update the dataframes to have only the columns we are interested in understaning
d.update((x,y[columns]) for x, y in d.items())
#we concatenate all the quarterly data into a single data frame
bikes_2012=pd.concat(d, ignore_index=True)

As mentioned before the third and fourth quarterly data frames have only 9 columns, and the missing variable is the duration of the trip in seconds. Fortunately we have information of the start and end of the trip down to seconds, thus we can calculate the duration of this trip using the to_datetime functionality. I first converted our start and end date to to_datetime and the subtracted the end date from the start date. Once the new column was created I parsed the results to seconds.

#we transform our data duration from seconds to minutes
bikes_2012=bikes_2012.rename(columns={'duration_sec': 'duration_min'})
#we get a description of the dataframe to see how many observations we have

Once we have all the columns we are going to need in the analysis we define a list with the columns we will use in the analysis and multiply our data frame dictionary, so we only select those columns. Once this process is complete, we concatenate the four dictionaries into a single data frame and check out how many trips we have in our data frame.

#we plot a boxplot
#we create a random sampe of 20,000 to graph using plotly
rows= random.sample(bikes_2012.index, 20000)
title('Capital Bikeshare Duration of Trips 2012')
mpl_fig1 = plt.gcf()
py_fig1 = tls.mpl_to_plotly(mpl_fig1, verbose=True)
py.iplot_mpl(mpl_fig1, filename='boxplot_test_1')

In total there were 2,049,576 trips. However not all of these correspond to real trips; if we plot our data we will realize that there are some very big outliers.

We create a boxplot using matplotlib and the nice wrapper that plotly created so we have a nice looking kind of interactive graph (we will have a post on using plotly soon). Given the plotly limitations (it cannot graph more than 50k observations) we create a random sample and use it to graph our data:

#we filter the data to get rid of observations above the the 95th percentile and below 0
bikes_2012=bikes_2012[bikes_2012.duration_min< bikes_2012.duration_min.quantile(0.95)]
#we know plot again to see how the distribution changed
#we plot using a sample. Remember to just plot a sample from our dataset
title('Capital Bikeshare Duration of Trips 2012')
mpl_fig1 = plt.gcf()
py_fig1 = tls.mpl_to_plotly(mpl_fig1, verbose=True)
py.iplot_mpl(mpl_fig1, filename='boxplot_test_2')
#we plot all our data
title('Capital Bikeshare Duration of Trips 2012 with all observations')

The sample shows that there are some very large outliers in our boxplot. To get a more accurate understanding of the average duration of our data we filter values higher than the 95th percentile, which should get rid of the one trip that lasted 83 days as well as others where people probably just forgot to return their bikes. This is the new boxplot:

figure_2 02172015

The distribution looks better, but we still have some outliers with negative values. If we compare this boxplot with the one obtained using a random sample we see that the distribution is less skewed and there are no outliers with negative numbers. However to understand better the data and have a more accurate duration time we will filter for trips that lasted less than 3 minutes (probably not really a trip and more a failure of the terminal) and we plot our distribution.

#we filter for values below 0 and less than 3 minutes
bikes_2012=bikes_2012[bikes_2012.duration_min> 3 ]
title("Capital Bikeshare Distribution of duration" +"\n" + "of Trips 2012")
#we get descriptive stats

figure_3, histogram We now see that our duration trip data seems to follow an exponential distribution (more on this topic later). The average duration of trip in 2012 is 12.9 minutes, while the median, and probably closer to trip duration is 10.9 minutes. Finally, lets see if this distribution changes by day of the week:

#lets create day of the week when the trip happened
bikes_2012['day']= bikes_2012['start_date'].dt.weekday
#plot by daym creating a random sample
rows= random.sample(bikes_2012.index, 20000)
duration_sample.boxplot(by='day', column='duration_min')
title("Capital Bikeshare Distribution of duration" +"\n" + "of Trips by day 2012")
mpl_fig1 = plt.gcf()
py_fig1 = tls.mpl_to_plotly(mpl_fig1, verbose=True)
py.iplot_mpl(mpl_fig1, filename='boxplot_test_3')
#in case you want to save
                  sep=',', index=False)

Capital Bikeshare Distribution of duration<br>of Trips by day 2012

We can see that ridership changes a little bit on day 5 and 6, which is equivalent to Saturdays and Sundays.

In the next post we will continue cleaning up the data and start looking into more advanced analysis of the Bikeshare system.