NYC Taxi Trip Duration EDA notebook
Process :
- 0 LOAD DATA 
- 1 SIMPLE FEATURE EXTRACT
- 2 DATA OVERVIEW
    -2.1 vendor_id
    -2.2 pickup_day & dropoff_day
    -2.3 passenger_count
    -2.4 pickup & dropoff locations (lon & lat )
    -2.5 store_and_fwd_flag
    -2.6 trip_duration
- 3 FEATURE ENGINEERING 
- 4 FEATURE ANALYSIS 
    -4.1 Duration VS. Distance
    -4.2 Driving Direction
    -4.3 Clustering 
    -4.4 Avg speed, orders on clustering 
    -4.5 Cyclic timestamp (week, hour, weekhour)
- 5 DATA CLEANING ANALYSIS 
- File descriptions
 
- train.csv - the training set (contains 1458644 trip records)
 - test.csv - the testing set (contains 625134 trip records)
 - sample_submission.csv - a sample submission file in the correct format
 
- Data fields
 
- id - a unique identifier for each trip
 - vendor_id - a code indicating the provider associated with the trip record
 - pickup_datetime - date and time when the meter was engaged
 - dropoff_datetime - date and time when the meter was disengaged
 - passenger_count - the number of passengers in the vehicle (driver entered value)
 - pickup_longitude - the longitude where the meter was engaged
 - pickup_latitude - the latitude where the meter was engaged
 - dropoff_longitude - the longitude where the meter was disengaged
 - dropoff_latitude - the latitude where the meter was disengaged
 - store_and_fwd_flag - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip
 - trip_duration - duration of the trip in seconds
 
import pandas as pd, numpy as np
%matplotlib inline
%pylab inline
import seaborn  as sns 
import matplotlib.pyplot as plt
# load data
df_train = pd.read_csv('~/NYC_Taxi_Trip_Duration/data/train.csv')
df_test = pd.read_csv('~/NYC_Taxi_Trip_Duration/data/test.csv')
sampleSubmission = pd.read_csv('~/NYC_Taxi_Trip_Duration/data/sample_submission.csv')
df_train.head(2)
# help function
def basic_feature_extract(df):
    df_= df.copy()
    # pickup
    df_["pickup_date"] = pd.to_datetime(df_.pickup_datetime.apply(lambda x : x.split(" ")[0]))
    df_["pickup_hour"] = df_.pickup_datetime.apply(lambda x : x.split(" ")[1].split(":")[0])
    df_["pickup_year"] = df_.pickup_datetime.apply(lambda x : x.split(" ")[0].split("-")[0])
    df_["pickup_month"] = df_.pickup_datetime.apply(lambda x : x.split(" ")[0].split("-")[1])
    df_["pickup_weekday"] = df_.pickup_datetime.apply(lambda x :pd.to_datetime(x.split(" ")[0]).weekday())
    # dropoff
    # in case test data dont have dropoff_datetime feature
    try:
        df_["dropoff_date"] = pd.to_datetime(df_.dropoff_datetime.apply(lambda x : x.split(" ")[0]))
        df_["dropoff_hour"] = df_.dropoff_datetime.apply(lambda x : x.split(" ")[1].split(":")[0])
        df_["dropoff_year"] = df_.dropoff_datetime.apply(lambda x : x.split(" ")[0].split("-")[0])
        df_["dropoff_month"] = df_.dropoff_datetime.apply(lambda x : x.split(" ")[0].split("-")[1])
        df_["dropoff_weekday"] = df_.dropoff_datetime.apply(lambda x :pd.to_datetime(x.split(" ")[0]).weekday())
    except:
        pass 
    return df_
# get weekday
import calendar
def get_weekday(df):
    list(calendar.day_name)
    df_=df.copy()
    df_['pickup_week_'] = pd.to_datetime(df_.pickup_datetime,coerce=True).dt.weekday
    df_['pickup_weekday_'] = df_['pickup_week_'].apply(lambda x: calendar.day_name[x])
    return df_
# get trip duration 
def get_duration(df):
    df_= df.copy()
    df_['trip_duration_cal'] = pd.to_datetime(df_['dropoff_datetime']) - pd.to_datetime(df_['pickup_datetime'])
    return df_
# one may take few minutes 
df_train_ = basic_feature_extract(df_train)
df_train_ = get_duration(df_train_)
df_train_ = get_weekday(df_train_)
df_train.info()
- train data set has  1458644 trip records (every ride per record),
  
no missing values 
df_train_.head(1)
df_train.vendor_id.value_counts()
Vendor_id means id of vendors, here we can see there are 2 vendors serve 
in this NYC taxi record, they have similar ride counts
fig, ax = plt.subplots(ncols=2, sharey=True)
fig.set_size_inches(12, 5)
ax[0].plot(df_train_.groupby('pickup_date').count()['id'], 'go-', alpha=0.5)
ax[1].plot(df_train_.groupby('dropoff_date').count()['id'], 'bo-', alpha=0.5)
ax[0].set(xlabel='date', ylabel='Count',title="pickup counts")
ax[1].set(xlabel='date', ylabel='Count',title="dropoff counts")
plt.show()
The orders countssudden fail in 2016-01-23, need to check deeper to see if this affect data quality
Averagely 7000 - 9000 counts of orders (pickup) per day.
df_train_.groupby(['pickup_date']).sum()['passenger_count'].plot()
df_train_.groupby(['dropoff_date']).sum()['passenger_count'].plot()
df_train_.passenger_count.value_counts(sort=False)
- Averagely 12k - 16k passengers per day
 - Most taxi take 1 passengers per ride, but some numbers like 0,7,8,9 maybe are outliers
 
# https://www.kaggle.com/misfyre/in-depth-nyc-taxi-eda-also-w-animation
# folloing ref above, here I drop outliers and do geo data visualization 
# (drop data point > 95% and < 5%)
# pickup 
sns.lmplot(x="pickup_longitude", y="pickup_latitude", fit_reg=False, 
           size=9, scatter_kws={'alpha':0.3,'s':5}, data=df_train_[(
                 df_train_.pickup_longitude>df_train_.pickup_longitude.quantile(0.005))
               &(df_train_.pickup_longitude<df_train_.pickup_longitude.quantile(0.995))
               &(df_train_.pickup_latitude>df_train_.pickup_latitude.quantile(0.005))                           
               &(df_train_.pickup_latitude<df_train_.pickup_latitude.quantile(0.995))])
plt.xlabel('Pickup Longitude');
plt.ylabel('Pickup Latitude');
plt.show()
# dropoff 
sns.lmplot(x="dropoff_longitude", y="dropoff_latitude",fit_reg=False, 
           size=9, scatter_kws={'alpha':0.3,'s':5}, data=df_train_[(
                 df_train_.dropoff_longitude>df_train_.dropoff_longitude.quantile(0.005))
               &(df_train_.dropoff_longitude<df_train_.dropoff_longitude.quantile(0.995))
               &(df_train_.dropoff_latitude>df_train_.dropoff_latitude.quantile(0.005))                           
               &(df_train_.dropoff_latitude<df_train_.dropoff_latitude.quantile(0.995))])
plt.xlabel('dropoff Longitude');
plt.ylabel('dropoff Latitude');
plt.show()
We get an very interesting outcome from this visualization :
Manhattan get
MOSTof pickup orders, since many people work there, it makes sense it's the place where main demands fromMany dropoff orders
outside the Manhattan areamaybe because people work/srudy in Manhattan, but live in areas like Queens/Brooklyn..JFK Airport (the point in east north) maybe be a
key pointaffect duration, since many orders back and forth within the city and JFK, it's anot shortdistance, may affect duration prediction alot.
df_train_.store_and_fwd_flag.value_counts()
About 0.5% (8045/1450599) of orders (store_and_fwd_flag=Y) are not sent immediately to vendor server, but hold in the memory of taxi, need to investigate if this affact data quality as well
df_train_.head(1)
# trip duration overall distribution 
# remove outlier by only taking data under .97 quantile
tripduration = df_train_[df_train_.trip_duration < df_train_.trip_duration.quantile(.97)]
tripduration.groupby('trip_duration').count()['id'].plot()
plt.xlabel('trip duration (sec)')
plt.ylabel('trips counts')
plt.title('Duration Distribution')
# pivot table
# http://pbpython.com/pandas-pivot-table-explained.html
#tripduration = df_train_[df_train_.trip_duration < df_train_.trip_duration.quantile(.97)]
pd.pivot_table(tripduration, index='pickup_hour' ,aggfunc=np.mean)['trip_duration'].plot(label='mean')
pd.pivot_table(tripduration, index='pickup_hour' ,aggfunc=np.median)['trip_duration'].plot(label='median')
pd.pivot_table(tripduration, index='pickup_hour' ,aggfunc=np.std)['trip_duration'].plot(label='std')
plt.legend(loc=0)
plt.xlabel('Pick up Hour (0-23)')
plt.ylabel('orders counts ')
plt.title('Orders VS Pickup hour')
plt.show()
# plots trip duration on store_and_fwd_flag 
# (send back data to vendor server directly or save in taxi then upload because internet issues)
plt.figure(figsize=(14,6))
sns.barplot(x='pickup_hour',y='trip_duration',data=df_train_,hue='store_and_fwd_flag')
plt.xlabel('pickup_hour',fontsize=16)
plt.ylabel('mean(trip_duration)',fontsize=16)
# duration VS. pickup hour in given months 
plt.figure(figsize=(14,6))
sns.pointplot(x='pickup_hour',y='trip_duration',data=tripduration,hue='pickup_month')
plt.xlabel('pickup_hour',fontsize=16)
plt.ylabel('mean(trip_duration)',fontsize=16)
# duration VS. pickup hour in weekdays 
plt.figure(figsize=(14,6))
sns.pointplot(x='pickup_hour',y='trip_duration',data=tripduration,hue='pickup_weekday_',hue_order=list(calendar.day_name))
plt.xlabel('pickup_hour',fontsize=16)
plt.ylabel('mean(trip_duration)',fontsize=16)
df_train_.groupby('trip_duration').count()['id'].tail(5)
# 86392 sec means 23 hours 
# 2049578 sec means 23 days 
# which are all no make sense 
Duration is the prediction target in this competition :
- Most trips finished within 
6-10 minute(400-600 sec) Some trips take too long/too fast are obvious wrong, maybe because of tech/maunal operation issues, should filter them in following process
Duration are
longerwhenstore_and_fwd_flag = Y, maybe because internet is bad in areas off downtowm, imply taxi driversmuch longerrelatively.Duration
RISEdramtically from 7 AM to 10 AM, maybe owing to traffic jam and people start moving to NYC from nearyby areas when daytimeMonths only affect duration
a little bit; whilepickup hoursseems may be infulence that moreWeekdayis apparently relative to duration.Duration are high duing wokday (Mon.-Fri.), since people work, taxi are much more busy to take people from Manhattan <---> outside Manhattan
# feature extract
###  distance 
# https://www.kaggle.com/gaborfodor/from-eda-to-the-top-lb-0-377?scriptVersionId=1369021
# Haversine distance
def get_haversine_distance(lat1, lng1, lat2, lng2):
    # km
    lat1, lng1, lat2, lng2 = map(np.radians, (lat1, lng1, lat2, lng2))
    AVG_EARTH_RADIUS = 6371  #  km
    lat = lat2 - lat1
    lng = lng2 - lng1
    d = np.sin(lat * 0.5) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(lng * 0.5) ** 2
    h = 2 * AVG_EARTH_RADIUS * np.arcsin(np.sqrt(d))
    return h 
# Manhattan distance
# Taxi cant fly ! have to move in blocks/roads
def get_manhattan_distance(lat1, lng1, lat2, lng2):
    # km 
    a = get_haversine_distance(lat1, lng1, lat1, lng2)
    b = get_haversine_distance(lat1, lng1, lat2, lng1)
    return a + b
# get direction (arc tangent angle)
def get_direction(lat1, lng1, lat2, lng2):
    # theta
    AVG_EARTH_RADIUS = 6371  #  km
    lng_delta_rad = np.radians(lng2 - lng1)
    lat1, lng1, lat2, lng2 = map(np.radians, (lat1, lng1, lat2, lng2))
    y = np.sin(lng_delta_rad) * np.cos(lat2)
    x = np.cos(lat1) * np.sin(lat2) - np.sin(lat1) * np.cos(lat2) * np.cos(lng_delta_rad)
    return np.degrees(np.arctan2(y, x))
### ======================== ###
def get_features(df):
    # km 
    df_ = df.copy()
    ###  USING .loc making return array ordering 
    # distance
    df_.loc[:, 'distance_haversine'] = get_haversine_distance(
                      df_['pickup_latitude'].values,
                      df_['pickup_longitude'].values,
                      df_['dropoff_latitude'].values,
                      df_['dropoff_longitude'].values)
    df_.loc[:, 'distance_manhattan'] = get_manhattan_distance(
                      df_['pickup_latitude'].values,
                      df_['pickup_longitude'].values,
                      df_['dropoff_latitude'].values,
                      df_['dropoff_longitude'].values)
    # direction 
    df_.loc[:, 'direction'] = get_direction(df_['pickup_latitude'].values,
                                          df_['pickup_longitude'].values, 
                                          df_['dropoff_latitude'].values, 
                                          df_['dropoff_longitude'].values)
    # Get Average driving speed 
    # km/hr
    # (km/sec = 3600 * (km/hr))
    df_.loc[:, 'avg_speed_h'] = 3600 * df_['distance_haversine'] / df_['trip_duration']
    df_.loc[:, 'avg_speed_m'] = 3600 * df_['distance_manhattan'] / df_['trip_duration']
    
    return df_
    
# get speed (taxi velocity)
#df_train_.loc[:, 'avg_speed_h'] = 1000 * df_train_['distance_haversine'] / df_train_['trip_duration']
#df_train_.loc[:, 'avg_speed_m'] = 1000 * df_train_['distance_manhattan'] / df_train_['trip_duration']
df_train_ = get_features(df_train_)
df_train_.head(1)
# distance VS duration 
sns.jointplot((df_train_["distance_manhattan"][:10000]+1),(df_train_["trip_duration"][:10000]+1),s=10,alpha=0.5,color='blue')
plt.xlabel('mile')
plt.ylabel('trip_duration')
plt.xlim(0,30)
plt.ylim(0,5000)
plt.title('Distance VS Duration')
plt.show()
# log distance VS duration 
sns.jointplot(np.log(df_train_["distance_manhattan"][:10000]+1),(df_train_["trip_duration"][:10000]+1),s=10,alpha=0.5,color='purple')
plt.xlim(0,30)
plt.ylim(0,5000)
plt.title('Logarithm Distance VS Duration')
plt.xlabel('log(distance_manhattan)')
plt.ylabel('trip_duration')
plt.show()
There are positive relations within distance and duration  in the cases : 
- (logarithm) haversine distance VS. duration 
- (logarithm) manhattan distance VS. duration 
which fit common sense: trip takes longer when distance longer
** log(distance) VS duration  seems has better co-relation, will add log(distance) as new feature 
# ref 
# https://www.kaggle.com/hanriver0618/nyc-taxi-data-exploration-visualization
# haversine distance VS duration 
# or u can plot all data points, still a linear relation 
#sns.jointplot(np.log10(df_train_["distance_haversine"][:10000]+1),np.log10(df_train_["distance_haversine"][:10000]+1),s=10,alpha=0.5,color='green')
sns.jointplot((df_train_["distance_haversine"][:10000]+1),(df_train_["distance_haversine"][:10000]+1),s=10,alpha=0.5,color='green')
plt.xlabel('log (mile)')
plt.ylabel('log (trip_duration)')
plt.show()
# manhattan distance VS duration 
# or u can plot all data points, still a linear relation
sns.jointplot((df_train_["distance_manhattan"][:10000]+1),(df_train_["distance_haversine"][:10000]+1),s=10,alpha=0.5,color='black')
plt.xlabel('log (mile)')
plt.ylabel('log (trip_duration)')
plt.show()
# remove potential outliers for better visualization
df_avg_speed = df_train_[(df_train_['avg_speed_h'] < df_train_['avg_speed_h'].quantile(0.99))&
         (df_train_['avg_speed_h'] > df_train_['avg_speed_h'].quantile(0.01))]
sns.distplot(df_avg_speed.avg_speed_h)
plt.xlabel('average haversine speed [km/hr]',fontsize=16)
plt.ylabel('relative amount',fontsize=16)
plt.title('Speed Distribution',fontsize=16)
plt.show()
# Speed in different situations 
plt.figure(figsize=(20,12))
pd.pivot_table(df_train_, index='pickup_hour',columns='pickup_weekday' ,aggfunc=np.mean)['avg_speed_h'].plot()
plt.xlabel('Pick up hour',fontsize=16)
plt.ylabel('Speed [km/hr] ',fontsize=16)
plt.title('Driving Speed VS. Pickup Hour VS. Pickup Weekday',fontsize=16)
# haversine speed visualization 
fig, ax = plt.subplots(nrows=3, sharey=True)
fig.set_size_inches(12, 10)
ax[0].plot(df_train_.groupby('pickup_hour').mean()['avg_speed_h'], 'bo-', lw=2, alpha=0.7)
ax[1].plot(df_train_.groupby('pickup_weekday').mean()['avg_speed_h'], 'go-', lw=2, alpha=0.7)
ax[2].plot(df_train_.groupby('pickup_date').mean()['avg_speed_h'], 'ro-', lw=2, alpha=0.7)
ax[0].set_xlabel('hour',fontsize=16)
ax[1].set_xlabel('weekday',fontsize=16)
ax[2].set_xlabel('date',fontsize=16)
ax[0].set_ylabel('average speed',fontsize=16)
ax[1].set_ylabel('average speed',fontsize=16)
ax[2].set_ylabel('average speed',fontsize=16)
fig.suptitle('Average Taxi Speed [km/hr] VS Pickup Times',fontsize=16)
plt.show()
# pickup 
sns.lmplot(x="pickup_longitude", y="pickup_latitude", fit_reg=False, 
           size=9, scatter_kws={'alpha':0.05,'s':5}, data=df_train_[(
                 df_train_.pickup_longitude>df_train_.pickup_longitude.quantile(0.005))
               &(df_train_.pickup_longitude<df_train_.pickup_longitude.quantile(0.995))
               &(df_train_.pickup_latitude>df_train_.pickup_latitude.quantile(0.005))                           
               &(df_train_.pickup_latitude<df_train_.pickup_latitude.quantile(0.995))])
JFK_location=[-73.778203,40.641165]
LaGuardia_location = [-73.873923,40.776935]
NYC_center = [-73.977282,40.770940]
### transform locations to dataframe 
locations=pd.DataFrame({'JFK':JFK_location,
             'LaGuardia':LaGuardia_location,
             'NYC':NYC_center}).T
locations.columns=['lon','lat']
###
plt.plot(JFK_location[0],JFK_location[1],'o', color = 'r',alpha=0.9,markersize=10)
plt.plot(LaGuardia_location[0],LaGuardia_location[1],'o', color = 'r',alpha=0.9,markersize=10)
plt.plot(NYC_center[0],NYC_center[1],'o', color = 'r',alpha=0.9,markersize=10)
plt.annotate('NYC_center', (NYC_center[0], NYC_center[1]), color = 'black', fontsize = 15)
plt.annotate('LaGuardia', (LaGuardia_location[0], LaGuardia_location[1]), color = 'black', fontsize = 15)
plt.annotate('JFK', (JFK_location[0], JFK_location[1]), color = 'black', fontsize = 15)
plt.xlabel('Pickup Longitude',fontsize=16);
plt.ylabel('Pickup Latitude',fontsize=16);
plt.title('JFK <--> NYC, LAG <---> NYC',fontsize=16)
plt.show()
# GET angles between main aiport and NYC downtown
JFK_NYC= get_direction(JFK_location[0],
                       JFK_location[1],
                       NYC_center[0],
                       NYC_center[1])
LAG_NYC= get_direction(LaGuardia_location[0],
                       LaGuardia_location[1],
                       NYC_center[0],
                       NYC_center[1])
print ('JFK - NYC_center angle = ',JFK_NYC )
print ('LAG - NYC_center angle = ',LAG_NYC )
sns.distplot(df_train_.direction,color='black')
plt.xlabel('driving direction',fontsize=16)
plt.ylabel('relative amount',fontsize=16)
plt.title('Direction Distribution',fontsize=16)
plt.show()
from sklearn.cluster import MiniBatchKMeans
# get lon & lat clustering for following avg location speed calculation
def get_clustering(df):
    coords = np.vstack((df_train[['pickup_latitude', 'pickup_longitude']].values,
                    df_train[['dropoff_latitude', 'dropoff_longitude']].values,
                    df_test[['pickup_latitude', 'pickup_longitude']].values,
                    df_test[['dropoff_latitude', 'dropoff_longitude']].values))
    df_ = df.copy()
    sample_ind = np.random.permutation(len(coords))[:500000]
    kmeans = MiniBatchKMeans(n_clusters=40, batch_size=10000).fit(coords[sample_ind])
    df_.loc[:, 'pickup_cluster'] = kmeans.predict(df_[['pickup_latitude', 'pickup_longitude']])
    df_.loc[:, 'dropoff_cluster'] = kmeans.predict(df_[['dropoff_latitude', 'dropoff_longitude']])
    # a little bit modify clustering function here 
    return df_,kmeans
df_train_,kmeans = get_clustering(df_train_)
# clustering pickup lon & lat
# https://www.kaggle.com/drgilermo/dynamics-of-new-york-city-animation
# https://matplotlib.org/users/annotations.html
fig,ax = plt.subplots(figsize = (10,8))
for label in df_train_.pickup_cluster.unique():
    ax.plot(df_train_.pickup_longitude[df_train_.pickup_cluster == label],df_train_.pickup_latitude[df_train_.pickup_cluster == label],'.', alpha = 0.9, markersize = 0.5)
    ax.plot(kmeans.cluster_centers_[label,1],kmeans.cluster_centers_[label,0],'o', color = 'r')
    ax.annotate(label, (kmeans.cluster_centers_[label,1],kmeans.cluster_centers_[label,0]), color = 'black', fontsize = 10)
plt.title('Pickup Clusters of New York',fontsize=20)
plt.xlim(-74.02,-73.85)
plt.ylim(40.65,40.85)
plt.show()
# clustering pickup lon & lat
# https://www.kaggle.com/drgilermo/dynamics-of-new-york-city-animation
# https://matplotlib.org/users/annotations.html
fig,ax = plt.subplots(figsize = (10,8))
for label in df_train_.dropoff_cluster.unique():
    ax.plot(df_train_.dropoff_longitude[df_train_.dropoff_cluster == label],df_train_.dropoff_latitude[df_train_.dropoff_cluster == label],'.', alpha = 0.9, markersize = 0.5)
    ax.plot(kmeans.cluster_centers_[label,1],kmeans.cluster_centers_[label,0],'o', color = 'r')
    ax.annotate(label, (kmeans.cluster_centers_[label,1],kmeans.cluster_centers_[label,0]), color = 'black', fontsize = 10)
plt.title('Dropoff Clusters of New York',fontsize=20)
plt.xlim(-74.02,-73.85)
plt.ylim(40.65,40.85)
plt.show()
df_train_.columns
# avg speed on cluster 
def avg_cluser_speed(df):
    df_ = df.copy()
    # avg speed on cluster 
    avg_cluser_h = df_.groupby(['pickup_cluster','dropoff_cluster']).mean()['avg_speed_h'].reset_index()
    avg_cluser_h.columns = ['pickup_cluster','dropoff_cluster','avg_speed_cluster_h']
    avg_cluser_m = df_.groupby(['pickup_cluster','dropoff_cluster']).mean()['avg_speed_m'].reset_index()
    avg_cluser_m.columns = ['pickup_cluster','dropoff_cluster','avg_speed_cluster_m']
    # merge dataframe
    df_ = pd.merge(df_,avg_cluser_h, how = 'left', on = ['pickup_cluster','dropoff_cluster'])
    df_ = pd.merge(df_,avg_cluser_m, how = 'left', on = ['pickup_cluster','dropoff_cluster'])
    return df_
# avg duration on cluster 
def avg_cluser_duration(df):
    df_ = df.copy()
    # avg speed on cluster 
    avg_cluser_duration = df_.groupby(['pickup_cluster','dropoff_cluster']).mean()['trip_duration'].reset_index()
    avg_cluser_duration.columns = ['pickup_cluster','dropoff_cluster','avg_cluster_duration']
    # merge dataframe
    df_ = pd.merge(df_,avg_cluser_duration, how = 'left', on = ['pickup_cluster','dropoff_cluster'])
    return df_
df_train_  = avg_cluser_speed(df_train_)
df_train_  = avg_cluser_duration(df_train_)
# plot avg speed as heat map 
# https://stackoverflow.com/questions/12857925/how-to-convert-data-values-into-color-information-for-matplotlib
hot = plt.get_cmap('Purples')
fig,ax = plt.subplots(figsize = (10,8))
for label in df_train_.dropoff_cluster.unique():
    avg_cluster_speed= round(df_train_[df_train_.pickup_cluster ==label]['avg_speed_cluster_h'].mean())
    ax.plot(df_train_.pickup_longitude[df_train_.pickup_cluster == label],df_train_.pickup_latitude[df_train_.pickup_cluster == label],'.'\
    ,c=hot(avg_cluster_speed/10), alpha = 0.2, markersize = .3)
    if avg_cluster_speed == 'nan':
        pass
    else:
        ax.annotate(avg_cluster_speed, (kmeans.cluster_centers_[label,1],kmeans.cluster_centers_[label,0]), color = 'red', fontsize = 13)
plt.title('Avg Cluster Speed (Haversine)',fontsize=20)
plt.xlim(-74.02,-73.85)
plt.ylim(40.65,40.85)
plt.show()
pd.DataFrame(df_train_.groupby('pickup_cluster').mean()['avg_speed_cluster_h']).transpose()             
    
# plot avg speed as heat map 
# https://stackoverflow.com/questions/12857925/how-to-convert-data-values-into-color-information-for-matplotlib
hot = plt.get_cmap('Purples')
fig,ax = plt.subplots(figsize = (10,8))
for label in df_train_.dropoff_cluster.unique():
    avg_cluster_speed= round(df_train_[df_train_.pickup_cluster ==label]['avg_cluster_duration'].mean())
    ax.plot(df_train_.pickup_longitude[df_train_.pickup_cluster == label],df_train_.pickup_latitude[df_train_.pickup_cluster == label],'.'\
    ,c=hot(avg_cluster_speed/10), alpha = 0.2, markersize = .3)
    if avg_cluster_speed == 'nan':
        pass
    else:
        ax.annotate(avg_cluster_speed, (kmeans.cluster_centers_[label,1],kmeans.cluster_centers_[label,0]), color = 'red', fontsize = 13)
plt.title('Avg Cluster Duration ',fontsize=20)
plt.xlim(-74.02,-73.85)
plt.ylim(40.65,40.85)
plt.show()
pd.DataFrame(df_train_.groupby('pickup_cluster').mean()['avg_cluster_duration']).transpose()             
    
plt.scatter(df_train_['avg_speed_cluster_h'],df_train_['avg_cluster_duration'],c='black')
plt.xlim(0,100)
plt.ylim(0,5000)
plt.title('Avg Cluster Speed VS. Avg Cluster Duration', fontsize=17)
plt.show()
                                                       
Out-of-downtown, and north area seems has higher avg cluster speed(16-21 km/hr), while middle and downtown area ubder 15 (km/hr)  mostly 
- Maybe because they are far away city center, driver can drive fast
 - Longer distance make driver drive a little bit rush
 
Avg cluster duration are longer in financial area(wall st.) and out-of-down area, maybe because
- traffie is more crowded in such area
 - Drivers tend to drive faster in case if longer distance (same as above)
 
There is a possible  positive co-relation within Avg cluster speed and duration 
- Maybe driver has their 
own clock, can control driving time in a reasonable time interval , i.e. making driving time not too long and too short.