Chapter 3 Data transformation
We typecasted the date and time columns to the required format. When plotting Borough, Zip Code and Crash time summary statistics, we dropped all instances where the values were not available and grouped by the appropriate variable. Depending on the graph requirements, we took subsets of the data using filtering. When plotting the Choropleth maps, we obtained the geojson data corresponding to NYC Boroughs and Zip Codes, and merged it with our transformed data.
The below dataframe represents our datatset with the date and time in appropriate formats and without NA values for the same.
## # A tibble: 6 × 30
## CRASH.DATE CRASH.TIME BOROUGH ZIP.CODE LATITUDE LONGITUDE LOCATION ON.STREET.NAME CROSS.STREET.NA… OFF.STREET.NAME NUMBER.OF.PERSO…
## <chr> <time> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 04/14/2021 00'00" BROOKLYN 11203 40.6 -73.9 (40.64664, -73.9246) KINGS HIGHWAY BEVERLEY ROAD <NA> 2
## 2 04/14/2021 00'00" <NA> NA 40.6 -74.0 (40.601864, -74.00232) BATH AVENUE <NA> <NA> 1
## 3 04/14/2021 00'00" BROOKLYN 11235 40.6 -74.0 (40.587894, -73.95504) <NA> <NA> 1415 AVEN… 0
## 4 04/14/2021 00'00" BRONX 10460 40.8 -73.9 (40.835827, -73.89068) <NA> <NA> 1600 BOST… 0
## 5 04/15/2021 00'00" QUEENS 11377 40.7 -73.9 (40.7357, -73.90377) 51 AVENUE 61 STREET <NA> 0
## 6 04/15/2021 00'00" <NA> NA 40.8 -73.9 (40.780437, -73.94989) EAST 90 STREET <NA> <NA> 0
## # … with 19 more variables: NUMBER.OF.PERSONS.KILLED <dbl>, NUMBER.OF.PEDESTRIANS.INJURED <dbl>, NUMBER.OF.PEDESTRIANS.KILLED <dbl>,
## # NUMBER.OF.CYCLIST.INJURED <dbl>, NUMBER.OF.CYCLIST.KILLED <dbl>, NUMBER.OF.MOTORIST.INJURED <dbl>, NUMBER.OF.MOTORIST.KILLED <dbl>,
## # CONTRIBUTING.FACTOR.VEHICLE.1 <chr>, CONTRIBUTING.FACTOR.VEHICLE.2 <chr>, CONTRIBUTING.FACTOR.VEHICLE.3 <chr>,
## # CONTRIBUTING.FACTOR.VEHICLE.4 <chr>, CONTRIBUTING.FACTOR.VEHICLE.5 <chr>, COLLISION_ID <dbl>, VEHICLE.TYPE.CODE.1 <chr>,
## # VEHICLE.TYPE.CODE.2 <chr>, VEHICLE.TYPE.CODE.3 <chr>, VEHICLE.TYPE.CODE.4 <chr>, VEHICLE.TYPE.CODE.5 <chr>, CRASH.TIME.hr.range <int>
We used the dataframe below to create a bar chart for the percentage of fatal accidents.
## # A tibble: 6 × 2
## # Groups: Hours [6]
## Hours `Percentage of fatal accidents`
## <int> <dbl>
## 1 0 0.204
## 2 1 0.307
## 3 2 0.338
## 4 3 0.468
## 5 4 0.541
## 6 5 0.359
This dataframe was used to get a line chart to observe the months with most accidents and fatal accidents.
## # A tibble: 6 × 3
## # Groups: Months [6]
## Months Accident.Count Fatal.Accident.Count
## <chr> <int> <int>
## 1 January 141170 167
## 2 February 130394 138
## 3 March 144807 138
## 4 April 133905 168
## 5 May 154104 179
## 6 June 156078 207
This dataframe was used to get a line chart to observe the days with most accidents and fatal accidents.
## # A tibble: 6 × 3
## # Groups: Month, Day [6]
## Month Day `Accident Count`
## <chr> <chr> <int>
## 1 01 01 3538
## 2 01 02 4183
## 3 01 03 4308
## 4 01 04 4150
## 5 01 05 4070
## 6 01 06 4563
To compare the worst and best Boroughs in terms of accident count we transformed the data in the following fashion
## # A tibble: 6 × 3
## # Groups: BOROUGH, CRASH.TIME.hr.range [6]
## BOROUGH CRASH.TIME.hr.range n
## <chr> <int> <int>
## 1 BROOKLYN 0 -11590
## 2 BROOKLYN 1 -5928
## 3 BROOKLYN 2 -4511
## 4 BROOKLYN 3 -3772
## 5 BROOKLYN 4 -4210
## 6 BROOKLYN 5 -4581
To observe the contour graph we created the folowing dataframe
## # A tibble: 6 × 4
## # Groups: LATITUDE, LONGITUDE [6]
## LATITUDE LONGITUDE BOROUGH Total
## <dbl> <dbl> <chr> <int>
## 1 40.5 -74.2 STATEN ISLAND 2
## 2 40.5 -74.2 STATEN ISLAND 1
## 3 40.5 -74.2 STATEN ISLAND 1
## 4 40.5 -74.2 STATEN ISLAND 1
## 5 40.5 -74.2 STATEN ISLAND 1
## 6 40.5 -74.2 STATEN ISLAND 1
To see the most improved and deteriorated zip codes we created the dataframe below
## ZIP.CODE mean_first mean_second Difference
## 21 10022 1979.2 988.2 -991.0
## 15 10016 2009.2 1077.0 -932.2
## 18 10019 1956.0 1234.2 -721.8
## 35 10036 1879.6 1186.2 -693.4
## 2 10001 1809.2 1145.8 -663.4
## 88 10454 709.4 813.6 104.2
Since, we find NA values in Number of people killed and/or injured, we replace those values with the individual sum of pedestrians, cyclists, motorists killed and/or injured respectively.
## CRASH.DATE CRASH.TIME BOROUGH ZIP.CODE LATITUDE LONGITUDE LOCATION NUMBER.OF.PERSONS.INJURED NUMBER.OF.PERSONS.KILLED
## 2 04/13/2021 21:35 BROOKLYN 11217 40.68358 -73.97617 (40.68358, -73.97617) 1 0
## 4 04/13/2021 16:00 BROOKLYN 11222 NA NA <NA> 0 0
## 7 04/13/2021 17:30 QUEENS 11106 NA NA <NA> 0 0
## 11 04/11/2021 21:06 BROOKLYN 11226 NA NA <NA> 1 0
## 12 04/15/2021 20:00 STATEN ISLAND 10304 NA NA <NA> 0 0
## 14 05/21/2019 22:50 BROOKLYN 11201 40.69754 -73.98312 (40.69754, -73.98312) 0 0
## NUMBER.OF.PEDESTRIANS.INJURED NUMBER.OF.PEDESTRIANS.KILLED NUMBER.OF.CYCLIST.INJURED NUMBER.OF.CYCLIST.KILLED NUMBER.OF.MOTORIST.INJURED
## 2 1 0 0 0 0
## 4 0 0 0 0 0
## 7 0 0 0 0 0
## 11 1 0 0 0 0
## 12 0 0 0 0 0
## 14 0 0 0 0 0
## NUMBER.OF.MOTORIST.KILLED TOT.KILLED TOT.INJURED
## 2 0 0 1
## 4 0 0 0
## 7 0 0 0
## 11 0 0 1
## 12 0 0 0
## 14 0 0 0
Here, we Groupby Borough and Zip Code to get Number of Persons Killed and Injured in the years 2012-2021.
## # A tibble: 6 × 3
## YEAR BOROUGH NUMBER.OF.PERSONS.KILLED
## <chr> <chr> <int>
## 1 2012 BRONX 17
## 2 2012 BROOKLYN 33
## 3 2012 MANHATTAN 21
## 4 2012 QUEENS 28
## 5 2012 STATEN ISLAND 11
## 6 2013 BRONX 32
Since Brooklyn had the most number of accidents, let us look into what are the top 20 areas (referenced by Zip Codes) in Brooklyn that have the highest number of fatalities (people killed and people injured).
## # A tibble: 6 × 2
## ZIP.CODE NUMBER.OF.PERSONS.TOT.HURT
## <chr> <int>
## 1 11207 8787
## 2 11203 6664
## 3 11236 6581
## 4 11212 5894
## 5 11226 5435
## 6 11208 5185
To visualize the impact caused by accidents in each Borough with thr help of a treemap, we made the following data transformations.
## # A tibble: 6 × 3
## YEAR BOROUGH Total.Hurt
## <chr> <chr> <int>
## 1 2012 BRONX 2895
## 2 2012 BROOKLYN 7559
## 3 2012 MANHATTAN 3891
## 4 2012 QUEENS 5534
## 5 2012 STATEN ISLAND 1256
## 6 2013 BRONX 6291
## # A tibble: 6 × 3
## YEAR BOROUGH Total.Injured
## <chr> <chr> <int>
## 1 2012 BRONX 2878
## 2 2012 BROOKLYN 7526
## 3 2012 MANHATTAN 3870
## 4 2012 QUEENS 5506
## 5 2012 STATEN ISLAND 1245
## 6 2013 BRONX 6259
## # A tibble: 6 × 3
## YEAR BOROUGH Total.Killed
## <chr> <chr> <int>
## 1 2012 BRONX 17
## 2 2012 BROOKLYN 33
## 3 2012 MANHATTAN 21
## 4 2012 QUEENS 28
## 5 2012 STATEN ISLAND 11
## 6 2013 BRONX 32
We make transformations to visualize Contributing Factors to accidents while also editing some spelling mistakes.
## CRASH.DATE CRASH.TIME BOROUGH ZIP.CODE LATITUDE LONGITUDE LOCATION NUMBER.OF.PERSONS.INJURED NUMBER.OF.PERSONS.KILLED
## 2 04/13/2021 21:35 BROOKLYN 11217 40.68358 -73.97617 (40.68358, -73.97617) 1 0
## 4 04/13/2021 16:00 BROOKLYN 11222 NA NA <NA> 0 0
## 7 04/13/2021 17:30 QUEENS 11106 NA NA <NA> 0 0
## 11 04/11/2021 21:06 BROOKLYN 11226 NA NA <NA> 1 0
## 12 04/15/2021 20:00 STATEN ISLAND 10304 NA NA <NA> 0 0
## 14 05/21/2019 22:50 BROOKLYN 11201 40.69754 -73.98312 (40.69754, -73.98312) 0 0
## NUMBER.OF.PEDESTRIANS.INJURED NUMBER.OF.PEDESTRIANS.KILLED NUMBER.OF.CYCLIST.INJURED NUMBER.OF.CYCLIST.KILLED NUMBER.OF.MOTORIST.INJURED
## 2 1 0 0 0 0
## 4 0 0 0 0 0
## 7 0 0 0 0 0
## 11 1 0 0 0 0
## 12 0 0 0 0 0
## 14 0 0 0 0 0
## NUMBER.OF.MOTORIST.KILLED CONTRIBUTING.FACTOR.VEHICLE.1 CONTRIBUTING.FACTOR.VEHICLE.2 CONTRIBUTING.FACTOR.VEHICLE.3
## 2 0 unspecified <NA> <NA>
## 4 0 following too closely unspecified <NA>
## 7 0 driver inattention/distraction unspecified <NA>
## 11 0 passing too closely <NA> <NA>
## 12 0 unspecified <NA> <NA>
## 14 0 passing or lane usage improper unspecified <NA>
## CONTRIBUTING.FACTOR.VEHICLE.4 CONTRIBUTING.FACTOR.VEHICLE.5 COLLISION_ID VEHICLE.TYPE.CODE.1 VEHICLE.TYPE.CODE.2 VEHICLE.TYPE.CODE.3
## 2 <NA> <NA> 4407147 sedan <NA> <NA>
## 4 <NA> <NA> 4407811 sedan <NA> <NA>
## 7 <NA> <NA> 4408019 sedan sedan <NA>
## 11 <NA> <NA> 4406488 taxi <NA> <NA>
## 12 <NA> <NA> 4408310 sedan <NA> <NA>
## 14 <NA> <NA> 4136992 �mbu taxi <NA>
## VEHICLE.TYPE.CODE.4 VEHICLE.TYPE.CODE.5 TOT.KILLED TOT.INJURED
## 2 <NA> <NA> 0 1
## 4 <NA> <NA> 0 0
## 7 <NA> <NA> 0 0
## 11 <NA> <NA> 0 1
## 12 <NA> <NA> 0 0
## 14 <NA> <NA> 0 0
To view the Contributing Factors against the number of people hurt in accidents, we make the following transformation.
## # A tibble: 6 × 2
## CONTRIBUTING.FACTOR.VEHICLE.1 NUMBER.OF.PERSONS.TOT.HURT
## <chr> <int>
## 1 unspecified 117775
## 2 driver inattention/distraction 66164
## 3 failure to yield right-of-way 37378
## 4 traffic control disregarded 13341
## 5 following too closely 12137
## 6 other vehicular 8149
To view the leading contributors in the most accident prone areas, we make the following transformation.
## CONTRIBUTING.FACTOR.VEHICLE.1 X2 X3 X4 X5 X6 X7
## 1 alcohol involvement 299 974 772 1693 1820 5558
## 2 backing unsafely 179 1065 1185 1541 1919 5889
## 3 driver inattention/distraction 2817 9295 12315 20678 20888 65993
## 4 driver inexperience 267 745 793 1355 1725 4885
## 5 failure to yield right-of-way 1523 4762 5165 12902 12837 37189
## 6 fatigued/drowsy 266 744 1173 1520 2086 5789
We make the following transformations to view vehicle types are involved in the maximum number of fatal accidents.
## # A tibble: 6 × 2
## VEHICLE.TYPE.CODE.1 Count
## <chr> <dbl>
## 1 sport utility / station wagon 367.
## 2 passenger vehicle 310.
## 3 sedan 296.
## 4 taxi 57.8
## 5 pick-up truck 27.0
## 6 4 dr sedan 26.9
To see which of these vehicles cause the highest number of fatalities we make the transformations given below.
## # A tibble: 6 × 2
## VEHICLE.TYPE.CODE.2 Count
## <chr> <int>
## 1 sport utility / station wagon 879
## 2 passenger vehicle 800
## 3 sedan 777
## 4 ambulance 192
## 5 unknown 190
## 6 taxi 142