Process :

- 0 LOAD DATA 
- 1 SIMPLE FEATURE EXTRACT
- 2 DATA OVERVIEW
    -2.1 vendor_id
    -2.2 pickup_day & dropoff_day
    -2.3 passenger_count
    -2.4 pickup & dropoff locations (lon & lat )
    -2.5 store_and_fwd_flag
    -2.6 trip_duration
- 3 FEATURE ENGINEERING 
- 4 FEATURE ANALYSIS 
    -4.1 Duration VS. Distance
    -4.2 Driving Direction
    -4.3 Clustering 
    -4.4 Avg speed, orders on clustering 
    -4.5 Cyclic timestamp (week, hour, weekhour)
- 5 DATA CLEANING ANALYSIS

File descriptions

train.csv - the training set (contains 1458644 trip records)

test.csv - the testing set (contains 625134 trip records)

sample_submission.csv - a sample submission file in the correct format

Data fields

id - a unique identifier for each trip

vendor_id - a code indicating the provider associated with the trip record

pickup_datetime - date and time when the meter was engaged

dropoff_datetime - date and time when the meter was disengaged

passenger_count - the number of passengers in the vehicle (driver entered value)

pickup_longitude - the longitude where the meter was engaged

pickup_latitude - the latitude where the meter was engaged

dropoff_longitude - the longitude where the meter was disengaged

dropoff_latitude - the latitude where the meter was disengaged

store_and_fwd_flag - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip

trip_duration - duration of the trip in seconds

import pandas as pd, numpy as np
%matplotlib inline
%pylab inline
import seaborn  as sns 
import matplotlib.pyplot as plt

Populating the interactive namespace from numpy and matplotlib

- 0) Load data

# load data
df_train = pd.read_csv('~/NYC_Taxi_Trip_Duration/data/train.csv')
df_test = pd.read_csv('~/NYC_Taxi_Trip_Duration/data/test.csv')
sampleSubmission = pd.read_csv('~/NYC_Taxi_Trip_Duration/data/sample_submission.csv')

df_train.head(2)

-1) Simple feature extract

# help function
def basic_feature_extract(df):
    df_= df.copy()
    # pickup
    df_["pickup_date"] = pd.to_datetime(df_.pickup_datetime.apply(lambda x : x.split(" ")[0]))
    df_["pickup_hour"] = df_.pickup_datetime.apply(lambda x : x.split(" ")[1].split(":")[0])
    df_["pickup_year"] = df_.pickup_datetime.apply(lambda x : x.split(" ")[0].split("-")[0])
    df_["pickup_month"] = df_.pickup_datetime.apply(lambda x : x.split(" ")[0].split("-")[1])
    df_["pickup_weekday"] = df_.pickup_datetime.apply(lambda x :pd.to_datetime(x.split(" ")[0]).weekday())
    # dropoff
    # in case test data dont have dropoff_datetime feature
    try:
        df_["dropoff_date"] = pd.to_datetime(df_.dropoff_datetime.apply(lambda x : x.split(" ")[0]))
        df_["dropoff_hour"] = df_.dropoff_datetime.apply(lambda x : x.split(" ")[1].split(":")[0])
        df_["dropoff_year"] = df_.dropoff_datetime.apply(lambda x : x.split(" ")[0].split("-")[0])
        df_["dropoff_month"] = df_.dropoff_datetime.apply(lambda x : x.split(" ")[0].split("-")[1])
        df_["dropoff_weekday"] = df_.dropoff_datetime.apply(lambda x :pd.to_datetime(x.split(" ")[0]).weekday())
    except:
        pass 
    return df_

# get weekday
import calendar

def get_weekday(df):
    list(calendar.day_name)
    df_=df.copy()
    df_['pickup_week_'] = pd.to_datetime(df_.pickup_datetime,coerce=True).dt.weekday
    df_['pickup_weekday_'] = df_['pickup_week_'].apply(lambda x: calendar.day_name[x])
    return df_

# get trip duration 
def get_duration(df):
    df_= df.copy()
    df_['trip_duration_cal'] = pd.to_datetime(df_['dropoff_datetime']) - pd.to_datetime(df_['pickup_datetime'])
    return df_

# one may take few minutes 
df_train_ = basic_feature_extract(df_train)

df_train_ = get_duration(df_train_)
df_train_ = get_weekday(df_train_)

//anaconda/envs/g_dash/lib/python3.4/site-packages/ipykernel_launcher.py:30: FutureWarning: the coerce=True keyword is deprecated, use errors='coerce' instead

- 2) Data overview

df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458644 entries, 0 to 1458643
Data columns (total 11 columns):
id                    1458644 non-null object
vendor_id             1458644 non-null int64
pickup_datetime       1458644 non-null object
dropoff_datetime      1458644 non-null object
passenger_count       1458644 non-null int64
pickup_longitude      1458644 non-null float64
pickup_latitude       1458644 non-null float64
dropoff_longitude     1458644 non-null float64
dropoff_latitude      1458644 non-null float64
store_and_fwd_flag    1458644 non-null object
trip_duration         1458644 non-null int64
dtypes: float64(4), int64(3), object(4)
memory usage: 122.4+ MB

train data set has 1458644 trip records (every ride per record), no missing values

df_train_.head(1)

2-1) vendor_id

df_train.vendor_id.value_counts()

2    780302
1    678342
Name: vendor_id, dtype: int64

Vendor_id means id of vendors, here we can see there are 2 vendors serve in this NYC taxi record, they have similar ride counts

2-2) pickup_day & dropoff_day

fig, ax = plt.subplots(ncols=2, sharey=True)
fig.set_size_inches(12, 5)

ax[0].plot(df_train_.groupby('pickup_date').count()['id'], 'go-', alpha=0.5)
ax[1].plot(df_train_.groupby('dropoff_date').count()['id'], 'bo-', alpha=0.5)

ax[0].set(xlabel='date', ylabel='Count',title="pickup counts")
ax[1].set(xlabel='date', ylabel='Count',title="dropoff counts")

plt.show()

The orders countssudden fail in 2016-01-23, need to check deeper to see if this affect data quality
Averagely 7000 - 9000 counts of orders (pickup) per day.

- 2-3) passenger_count

df_train_.groupby(['pickup_date']).sum()['passenger_count'].plot()
df_train_.groupby(['dropoff_date']).sum()['passenger_count'].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x1783c78d0>

df_train_.passenger_count.value_counts(sort=False)

0         60
1    1033540
2     210318
3      59896
4      28404
5      78088
6      48333
7          3
8          1
9          1
Name: passenger_count, dtype: int64

Averagely 12k - 16k passengers per day
Most taxi take 1 passengers per ride, but some numbers like 0,7,8,9 maybe are outliers

2-4) pickup & dropoff locations (lon & lat )

# https://www.kaggle.com/misfyre/in-depth-nyc-taxi-eda-also-w-animation
# folloing ref above, here I drop outliers and do geo data visualization 
# (drop data point > 95% and < 5%)

# pickup 
sns.lmplot(x="pickup_longitude", y="pickup_latitude", fit_reg=False, 
           size=9, scatter_kws={'alpha':0.3,'s':5}, data=df_train_[(
                 df_train_.pickup_longitude>df_train_.pickup_longitude.quantile(0.005))
               &(df_train_.pickup_longitude<df_train_.pickup_longitude.quantile(0.995))
               &(df_train_.pickup_latitude>df_train_.pickup_latitude.quantile(0.005))                           
               &(df_train_.pickup_latitude<df_train_.pickup_latitude.quantile(0.995))])

plt.xlabel('Pickup Longitude');
plt.ylabel('Pickup Latitude');
plt.show()

# dropoff 
sns.lmplot(x="dropoff_longitude", y="dropoff_latitude",fit_reg=False, 
           size=9, scatter_kws={'alpha':0.3,'s':5}, data=df_train_[(
                 df_train_.dropoff_longitude>df_train_.dropoff_longitude.quantile(0.005))
               &(df_train_.dropoff_longitude<df_train_.dropoff_longitude.quantile(0.995))
               &(df_train_.dropoff_latitude>df_train_.dropoff_latitude.quantile(0.005))                           
               &(df_train_.dropoff_latitude<df_train_.dropoff_latitude.quantile(0.995))])

plt.xlabel('dropoff Longitude');
plt.ylabel('dropoff Latitude');
plt.show()

We get an very interesting outcome from this visualization :

Manhattan get MOST of pickup orders, since many people work there, it makes sense it's the place where main demands from
Many dropoff orders outside the Manhattan area maybe because people work/srudy in Manhattan, but live in areas like Queens/Brooklyn..
JFK Airport (the point in east north) maybe be a key point affect duration, since many orders back and forth within the city and JFK, it's a not short distance, may affect duration prediction alot.

2-5) store_and_fwd_flag

df_train_.store_and_fwd_flag.value_counts()

N    1450599
Y       8045
Name: store_and_fwd_flag, dtype: int64

About 0.5% (8045/1450599) of orders (store_and_fwd_flag=Y) are not sent immediately to vendor server, but hold in the memory of taxi, need to investigate if this affact data quality as well

2-6) trip_duration

df_train_.head(1)

# trip duration overall distribution 
# remove outlier by only taking data under .97 quantile

tripduration = df_train_[df_train_.trip_duration < df_train_.trip_duration.quantile(.97)]
tripduration.groupby('trip_duration').count()['id'].plot()

plt.xlabel('trip duration (sec)')
plt.ylabel('trips counts')
plt.title('Duration Distribution')

<matplotlib.text.Text at 0x12b918588>

# pivot table
# http://pbpython.com/pandas-pivot-table-explained.html

#tripduration = df_train_[df_train_.trip_duration < df_train_.trip_duration.quantile(.97)]


pd.pivot_table(tripduration, index='pickup_hour' ,aggfunc=np.mean)['trip_duration'].plot(label='mean')
pd.pivot_table(tripduration, index='pickup_hour' ,aggfunc=np.median)['trip_duration'].plot(label='median')
pd.pivot_table(tripduration, index='pickup_hour' ,aggfunc=np.std)['trip_duration'].plot(label='std')

plt.legend(loc=0)
plt.xlabel('Pick up Hour (0-23)')
plt.ylabel('orders counts ')
plt.title('Orders VS Pickup hour')
plt.show()

# plots trip duration on store_and_fwd_flag 
# (send back data to vendor server directly or save in taxi then upload because internet issues)

plt.figure(figsize=(14,6))
sns.barplot(x='pickup_hour',y='trip_duration',data=df_train_,hue='store_and_fwd_flag')
plt.xlabel('pickup_hour',fontsize=16)
plt.ylabel('mean(trip_duration)',fontsize=16)

<matplotlib.text.Text at 0x10bdb8710>

# duration VS. pickup hour in given months 

plt.figure(figsize=(14,6))
sns.pointplot(x='pickup_hour',y='trip_duration',data=tripduration,hue='pickup_month')
plt.xlabel('pickup_hour',fontsize=16)
plt.ylabel('mean(trip_duration)',fontsize=16)

<matplotlib.text.Text at 0x12b372710>

# duration VS. pickup hour in weekdays 


plt.figure(figsize=(14,6))
sns.pointplot(x='pickup_hour',y='trip_duration',data=tripduration,hue='pickup_weekday_',hue_order=list(calendar.day_name))
plt.xlabel('pickup_hour',fontsize=16)
plt.ylabel('mean(trip_duration)',fontsize=16)

<matplotlib.text.Text at 0x10c087470>

df_train_.groupby('trip_duration').count()['id'].tail(5)

# 86392 sec means 23 hours 
# 2049578 sec means 23 days 
# which are all no make sense

trip_duration
86392      1
1939736    1
2049578    1
2227612    1
3526282    1
Name: id, dtype: int64

Duration is the prediction target in this competition :

Most trips finished within 6-10 minute (400-600 sec)
Some trips take too long/too fast are obvious wrong, maybe because of tech/maunal operation issues, should filter them in following process
Duration are longer when store_and_fwd_flag = Y, maybe because internet is bad in areas off downtowm, imply taxi drivers much longer relatively.
Duration RISE dramtically from 7 AM to 10 AM, maybe owing to traffic jam and people start moving to NYC from nearyby areas when daytime
Months only affect duration a little bit ; while pickup hours seems may be infulence that more
Weekday is apparently relative to duration. Duration are high duing wokday (Mon.-Fri.), since people work, taxi are much more busy to take people from Manhattan <---> outside Manhattan

3) Feature engineering

# feature extract

###  distance 
# https://www.kaggle.com/gaborfodor/from-eda-to-the-top-lb-0-377?scriptVersionId=1369021

# Haversine distance
def get_haversine_distance(lat1, lng1, lat2, lng2):
    # km
    lat1, lng1, lat2, lng2 = map(np.radians, (lat1, lng1, lat2, lng2))
    AVG_EARTH_RADIUS = 6371  #  km
    lat = lat2 - lat1
    lng = lng2 - lng1
    d = np.sin(lat * 0.5) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(lng * 0.5) ** 2
    h = 2 * AVG_EARTH_RADIUS * np.arcsin(np.sqrt(d))
    return h 

# Manhattan distance
# Taxi cant fly ! have to move in blocks/roads
def get_manhattan_distance(lat1, lng1, lat2, lng2):
    # km 
    a = get_haversine_distance(lat1, lng1, lat1, lng2)
    b = get_haversine_distance(lat1, lng1, lat2, lng1)
    return a + b


# get direction (arc tangent angle)
def get_direction(lat1, lng1, lat2, lng2):
    # theta
    AVG_EARTH_RADIUS = 6371  #  km
    lng_delta_rad = np.radians(lng2 - lng1)
    lat1, lng1, lat2, lng2 = map(np.radians, (lat1, lng1, lat2, lng2))
    y = np.sin(lng_delta_rad) * np.cos(lat2)
    x = np.cos(lat1) * np.sin(lat2) - np.sin(lat1) * np.cos(lat2) * np.cos(lng_delta_rad)
    return np.degrees(np.arctan2(y, x))



### ======================== ###

def get_features(df):
    # km 
    df_ = df.copy()
    ###  USING .loc making return array ordering 
    # distance
    df_.loc[:, 'distance_haversine'] = get_haversine_distance(
                      df_['pickup_latitude'].values,
                      df_['pickup_longitude'].values,
                      df_['dropoff_latitude'].values,
                      df_['dropoff_longitude'].values)
    df_.loc[:, 'distance_manhattan'] = get_manhattan_distance(
                      df_['pickup_latitude'].values,
                      df_['pickup_longitude'].values,
                      df_['dropoff_latitude'].values,
                      df_['dropoff_longitude'].values)
    # direction 
    df_.loc[:, 'direction'] = get_direction(df_['pickup_latitude'].values,
                                          df_['pickup_longitude'].values, 
                                          df_['dropoff_latitude'].values, 
                                          df_['dropoff_longitude'].values)
    # Get Average driving speed 
    # km/hr
    # (km/sec = 3600 * (km/hr))
    df_.loc[:, 'avg_speed_h'] = 3600 * df_['distance_haversine'] / df_['trip_duration']
    df_.loc[:, 'avg_speed_m'] = 3600 * df_['distance_manhattan'] / df_['trip_duration']
    
    return df_

# get speed (taxi velocity)
#df_train_.loc[:, 'avg_speed_h'] = 1000 * df_train_['distance_haversine'] / df_train_['trip_duration']
#df_train_.loc[:, 'avg_speed_m'] = 1000 * df_train_['distance_manhattan'] / df_train_['trip_duration']

df_train_ = get_features(df_train_)

df_train_.head(1)

4) Feature analysis

4-1) Duration VS Distance

# distance VS duration 
sns.jointplot((df_train_["distance_manhattan"][:10000]+1),(df_train_["trip_duration"][:10000]+1),s=10,alpha=0.5,color='blue')
plt.xlabel('mile')
plt.ylabel('trip_duration')
plt.xlim(0,30)
plt.ylim(0,5000)
plt.title('Distance VS Duration')
plt.show()

# log distance VS duration 
sns.jointplot(np.log(df_train_["distance_manhattan"][:10000]+1),(df_train_["trip_duration"][:10000]+1),s=10,alpha=0.5,color='purple')
plt.xlim(0,30)
plt.ylim(0,5000)
plt.title('Logarithm Distance VS Duration')
plt.xlabel('log(distance_manhattan)')
plt.ylabel('trip_duration')
plt.show()

There are positive relations within `distance and duration` in the cases :

- (logarithm) haversine distance VS. duration 
- (logarithm) manhattan distance VS. duration

which fit common sense: trip takes longer when distance longer

** log(distance) VS duration seems has `better co-relation`, will add log(distance) as new feature

4-1') haversine VS manhattan

# ref 
# https://www.kaggle.com/hanriver0618/nyc-taxi-data-exploration-visualization

# haversine distance VS duration 
# or u can plot all data points, still a linear relation 
#sns.jointplot(np.log10(df_train_["distance_haversine"][:10000]+1),np.log10(df_train_["distance_haversine"][:10000]+1),s=10,alpha=0.5,color='green')
sns.jointplot((df_train_["distance_haversine"][:10000]+1),(df_train_["distance_haversine"][:10000]+1),s=10,alpha=0.5,color='green')
plt.xlabel('log (mile)')
plt.ylabel('log (trip_duration)')
plt.show()

# manhattan distance VS duration 
# or u can plot all data points, still a linear relation
sns.jointplot((df_train_["distance_manhattan"][:10000]+1),(df_train_["distance_haversine"][:10000]+1),s=10,alpha=0.5,color='black')
plt.xlabel('log (mile)')
plt.ylabel('log (trip_duration)')
plt.show()

Haversine and manhattan (distance) are too `linear co-related`,

maybe only input one of them when modeling, avoiding possbile co-linear effect by too much relation features in same model

4-2) Speed VS Times

# remove potential outliers for better visualization
df_avg_speed = df_train_[(df_train_['avg_speed_h'] < df_train_['avg_speed_h'].quantile(0.99))&
         (df_train_['avg_speed_h'] > df_train_['avg_speed_h'].quantile(0.01))]

sns.distplot(df_avg_speed.avg_speed_h)
plt.xlabel('average haversine speed [km/hr]',fontsize=16)
plt.ylabel('relative amount',fontsize=16)
plt.title('Speed Distribution',fontsize=16)
plt.show()

# Speed in different situations 
plt.figure(figsize=(20,12))
pd.pivot_table(df_train_, index='pickup_hour',columns='pickup_weekday' ,aggfunc=np.mean)['avg_speed_h'].plot()
plt.xlabel('Pick up hour',fontsize=16)
plt.ylabel('Speed [km/hr] ',fontsize=16)
plt.title('Driving Speed VS. Pickup Hour VS. Pickup Weekday',fontsize=16)

<matplotlib.text.Text at 0x12b5491d0>

<matplotlib.figure.Figure at 0x13759a3c8>

# haversine speed visualization 

fig, ax = plt.subplots(nrows=3, sharey=True)
fig.set_size_inches(12, 10)
ax[0].plot(df_train_.groupby('pickup_hour').mean()['avg_speed_h'], 'bo-', lw=2, alpha=0.7)
ax[1].plot(df_train_.groupby('pickup_weekday').mean()['avg_speed_h'], 'go-', lw=2, alpha=0.7)
ax[2].plot(df_train_.groupby('pickup_date').mean()['avg_speed_h'], 'ro-', lw=2, alpha=0.7)
ax[0].set_xlabel('hour',fontsize=16)
ax[1].set_xlabel('weekday',fontsize=16)
ax[2].set_xlabel('date',fontsize=16)
ax[0].set_ylabel('average speed',fontsize=16)
ax[1].set_ylabel('average speed',fontsize=16)
ax[2].set_ylabel('average speed',fontsize=16)
fig.suptitle('Average Taxi Speed [km/hr] VS Pickup Times',fontsize=16)
plt.show()

`Speed` relate to `Pickup Hour` strongly

- 1. When in `non-rush hours`, taxi can drive faster
- 2. Speed peak is within 5 am in weekday ; within 7 am in weekend
- 3. Speed fluctuate periodically by days 
- 4. Average speed of around 15 km/h looks reasonable for NYC.

4-3) Driving Direction

# pickup 
sns.lmplot(x="pickup_longitude", y="pickup_latitude", fit_reg=False, 
           size=9, scatter_kws={'alpha':0.05,'s':5}, data=df_train_[(
                 df_train_.pickup_longitude>df_train_.pickup_longitude.quantile(0.005))
               &(df_train_.pickup_longitude<df_train_.pickup_longitude.quantile(0.995))
               &(df_train_.pickup_latitude>df_train_.pickup_latitude.quantile(0.005))                           
               &(df_train_.pickup_latitude<df_train_.pickup_latitude.quantile(0.995))])

JFK_location=[-73.778203,40.641165]
LaGuardia_location = [-73.873923,40.776935]
NYC_center = [-73.977282,40.770940]

### transform locations to dataframe 
locations=pd.DataFrame({'JFK':JFK_location,
             'LaGuardia':LaGuardia_location,
             'NYC':NYC_center}).T
locations.columns=['lon','lat']
###

plt.plot(JFK_location[0],JFK_location[1],'o', color = 'r',alpha=0.9,markersize=10)
plt.plot(LaGuardia_location[0],LaGuardia_location[1],'o', color = 'r',alpha=0.9,markersize=10)
plt.plot(NYC_center[0],NYC_center[1],'o', color = 'r',alpha=0.9,markersize=10)
plt.annotate('NYC_center', (NYC_center[0], NYC_center[1]), color = 'black', fontsize = 15)
plt.annotate('LaGuardia', (LaGuardia_location[0], LaGuardia_location[1]), color = 'black', fontsize = 15)
plt.annotate('JFK', (JFK_location[0], JFK_location[1]), color = 'black', fontsize = 15)


plt.xlabel('Pickup Longitude',fontsize=16);
plt.ylabel('Pickup Latitude',fontsize=16);
plt.title('JFK <--> NYC, LAG <---> NYC',fontsize=16)
plt.show()

# GET angles between main aiport and NYC downtown
JFK_NYC= get_direction(JFK_location[0],
                       JFK_location[1],
                       NYC_center[0],
                       NYC_center[1])

LAG_NYC= get_direction(LaGuardia_location[0],
                       LaGuardia_location[1],
                       NYC_center[0],
                       NYC_center[1])

print ('JFK - NYC_center angle = ',JFK_NYC )
print ('LAG - NYC_center angle = ',LAG_NYC )

JFK - NYC_center angle =  169.801849301
LAG - NYC_center angle =  -179.082799119

sns.distplot(df_train_.direction,color='black')
plt.xlabel('driving direction',fontsize=16)
plt.ylabel('relative amount',fontsize=16)
plt.title('Direction Distribution',fontsize=16)
plt.show()

There are 2 peaks around `30 and -150 degrees` in direction distribution

Maybe
- The 30 ones represent orders from JFK -> Manhattan ;
- The 150 ones represent orders form LAF -> Manhattan ;
JFK - NYC_center angle = 169.801849301
LAG - NYC_center angle = -179.082799119

-4.3) Clustering

from sklearn.cluster import MiniBatchKMeans
# get lon & lat clustering for following avg location speed calculation
def get_clustering(df):
    coords = np.vstack((df_train[['pickup_latitude', 'pickup_longitude']].values,
                    df_train[['dropoff_latitude', 'dropoff_longitude']].values,
                    df_test[['pickup_latitude', 'pickup_longitude']].values,
                    df_test[['dropoff_latitude', 'dropoff_longitude']].values))
    df_ = df.copy()
    sample_ind = np.random.permutation(len(coords))[:500000]
    kmeans = MiniBatchKMeans(n_clusters=40, batch_size=10000).fit(coords[sample_ind])
    df_.loc[:, 'pickup_cluster'] = kmeans.predict(df_[['pickup_latitude', 'pickup_longitude']])
    df_.loc[:, 'dropoff_cluster'] = kmeans.predict(df_[['dropoff_latitude', 'dropoff_longitude']])
    # a little bit modify clustering function here 
    return df_,kmeans

df_train_,kmeans = get_clustering(df_train_)

# clustering pickup lon & lat
# https://www.kaggle.com/drgilermo/dynamics-of-new-york-city-animation
# https://matplotlib.org/users/annotations.html

fig,ax = plt.subplots(figsize = (10,8))
for label in df_train_.pickup_cluster.unique():
    ax.plot(df_train_.pickup_longitude[df_train_.pickup_cluster == label],df_train_.pickup_latitude[df_train_.pickup_cluster == label],'.', alpha = 0.9, markersize = 0.5)
    ax.plot(kmeans.cluster_centers_[label,1],kmeans.cluster_centers_[label,0],'o', color = 'r')
    ax.annotate(label, (kmeans.cluster_centers_[label,1],kmeans.cluster_centers_[label,0]), color = 'black', fontsize = 10)

plt.title('Pickup Clusters of New York',fontsize=20)
plt.xlim(-74.02,-73.85)
plt.ylim(40.65,40.85)
plt.show()

# clustering pickup lon & lat
# https://www.kaggle.com/drgilermo/dynamics-of-new-york-city-animation
# https://matplotlib.org/users/annotations.html

fig,ax = plt.subplots(figsize = (10,8))
for label in df_train_.dropoff_cluster.unique():
    ax.plot(df_train_.dropoff_longitude[df_train_.dropoff_cluster == label],df_train_.dropoff_latitude[df_train_.dropoff_cluster == label],'.', alpha = 0.9, markersize = 0.5)
    ax.plot(kmeans.cluster_centers_[label,1],kmeans.cluster_centers_[label,0],'o', color = 'r')
    ax.annotate(label, (kmeans.cluster_centers_[label,1],kmeans.cluster_centers_[label,0]), color = 'black', fontsize = 10)

plt.title('Dropoff Clusters of New York',fontsize=20)
plt.xlim(-74.02,-73.85)
plt.ylim(40.65,40.85)
plt.show()

4.4) Avg speed, orders on clustering

df_train_.columns

Index(['id', 'vendor_id', 'pickup_datetime', 'dropoff_datetime',
       'passenger_count', 'pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude', 'store_and_fwd_flag',
       'trip_duration', 'pickup_date', 'pickup_hour', 'pickup_year',
       'pickup_month', 'pickup_weekday', 'dropoff_date', 'dropoff_hour',
       'dropoff_year', 'dropoff_month', 'dropoff_weekday', 'trip_duration_cal',
       'pickup_week_', 'pickup_weekday_', 'distance_haversine',
       'distance_manhattan', 'direction', 'avg_speed_h', 'avg_speed_m',
       'pickup_cluster', 'dropoff_cluster'],
      dtype='object')

# avg speed on cluster 
def avg_cluser_speed(df):
    df_ = df.copy()
    # avg speed on cluster 
    avg_cluser_h = df_.groupby(['pickup_cluster','dropoff_cluster']).mean()['avg_speed_h'].reset_index()
    avg_cluser_h.columns = ['pickup_cluster','dropoff_cluster','avg_speed_cluster_h']
    avg_cluser_m = df_.groupby(['pickup_cluster','dropoff_cluster']).mean()['avg_speed_m'].reset_index()
    avg_cluser_m.columns = ['pickup_cluster','dropoff_cluster','avg_speed_cluster_m']
    # merge dataframe
    df_ = pd.merge(df_,avg_cluser_h, how = 'left', on = ['pickup_cluster','dropoff_cluster'])
    df_ = pd.merge(df_,avg_cluser_m, how = 'left', on = ['pickup_cluster','dropoff_cluster'])
    return df_

# avg duration on cluster 
def avg_cluser_duration(df):
    df_ = df.copy()
    # avg speed on cluster 
    avg_cluser_duration = df_.groupby(['pickup_cluster','dropoff_cluster']).mean()['trip_duration'].reset_index()
    avg_cluser_duration.columns = ['pickup_cluster','dropoff_cluster','avg_cluster_duration']
    # merge dataframe
    df_ = pd.merge(df_,avg_cluser_duration, how = 'left', on = ['pickup_cluster','dropoff_cluster'])
    return df_

df_train_  = avg_cluser_speed(df_train_)
df_train_  = avg_cluser_duration(df_train_)

# plot avg speed as heat map 


# https://stackoverflow.com/questions/12857925/how-to-convert-data-values-into-color-information-for-matplotlib
hot = plt.get_cmap('Purples')

fig,ax = plt.subplots(figsize = (10,8))
for label in df_train_.dropoff_cluster.unique():
    avg_cluster_speed= round(df_train_[df_train_.pickup_cluster ==label]['avg_speed_cluster_h'].mean())

    ax.plot(df_train_.pickup_longitude[df_train_.pickup_cluster == label],df_train_.pickup_latitude[df_train_.pickup_cluster == label],'.'\
    ,c=hot(avg_cluster_speed/10), alpha = 0.2, markersize = .3)
    if avg_cluster_speed == 'nan':
        pass
    else:
        ax.annotate(avg_cluster_speed, (kmeans.cluster_centers_[label,1],kmeans.cluster_centers_[label,0]), color = 'red', fontsize = 13)

plt.title('Avg Cluster Speed (Haversine)',fontsize=20)
plt.xlim(-74.02,-73.85)
plt.ylim(40.65,40.85)
plt.show()


pd.DataFrame(df_train_.groupby('pickup_cluster').mean()['avg_speed_cluster_h']).transpose()

# plot avg speed as heat map 


# https://stackoverflow.com/questions/12857925/how-to-convert-data-values-into-color-information-for-matplotlib
hot = plt.get_cmap('Purples')

fig,ax = plt.subplots(figsize = (10,8))
for label in df_train_.dropoff_cluster.unique():
    avg_cluster_speed= round(df_train_[df_train_.pickup_cluster ==label]['avg_cluster_duration'].mean())

    ax.plot(df_train_.pickup_longitude[df_train_.pickup_cluster == label],df_train_.pickup_latitude[df_train_.pickup_cluster == label],'.'\
    ,c=hot(avg_cluster_speed/10), alpha = 0.2, markersize = .3)
    if avg_cluster_speed == 'nan':
        pass
    else:
        ax.annotate(avg_cluster_speed, (kmeans.cluster_centers_[label,1],kmeans.cluster_centers_[label,0]), color = 'red', fontsize = 13)

plt.title('Avg Cluster Duration ',fontsize=20)
plt.xlim(-74.02,-73.85)
plt.ylim(40.65,40.85)
plt.show()


pd.DataFrame(df_train_.groupby('pickup_cluster').mean()['avg_cluster_duration']).transpose()

plt.scatter(df_train_['avg_speed_cluster_h'],df_train_['avg_cluster_duration'],c='black')
plt.xlim(0,100)
plt.ylim(0,5000)
plt.title('Avg Cluster Speed VS. Avg Cluster Duration', fontsize=17)
plt.show()

Out-of-downtown, and north area seems has `higher` avg cluster speed`(16-21 km/hr)`, while middle and downtown area `ubder 15 (km/hr)` mostly

Maybe because they are far away city center, driver can drive fast
Longer distance make driver drive a little bit rush

Avg cluster duration are longer in financial area(wall st.) and out-of-down area, maybe because

traffie is more crowded in such area
Drivers tend to drive faster in case if longer distance (same as above)

There is a possible `positive co-relation` within Avg cluster speed and duration

Maybe driver has their own clock, can control driving time in a reasonable time interval , i.e. making driving time not too long and too short.

	id	vendor_id	pickup_datetime	dropoff_datetime	passenger_count	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	store_and_fwd_flag	trip_duration
0	id2875421	2	2016-03-14 17:24:55	2016-03-14 17:32:30	1	-73.982155	40.767937	-73.964630	40.765602	N	455
1	id2377394	1	2016-06-12 00:43:35	2016-06-12 00:54:38	1	-73.980415	40.738564	-73.999481	40.731152	N	663