Analysis of Fatal Road Accidents in Australia since 1989

Introduction

This notebook looks at fatal road crashes in Australia from 1989 onward using the Australian Road Deaths Database (ARDD). The focus is simple: who is most affected, where risk is concentrated, and how patterns have shifted over time.

The key objectives of this analysis are to:

  • Identify the demographic groups most affected by fatal road accidents

  • Examine how fatalities are distributed across Australian states and territories

  • Explore patterns in accident timing (days of the week, months of the year)

  • Analyse trends over time in road fatalities overall and among vulnerable road users (pedestrians, cyclists)

  • Use straightforward data cleaning and exploratory visualisation to surface practical trends for policy and prevention work.

Key Findings

  • Road fatalities have steadily declined from 1989 to 2025, both in raw numbers and when adjusted per 100,000 population.

  • Young drivers (aged 17–25) historically had the highest fatality rates but have seen substantial improvements, likely due to licensing restrictions.

  • Males account for approximately 70% of all road fatalities, a ratio that has remained consistent over time.

  • The majority of fatal accidents occur between Friday and Sunday, with Friday afternoon being the most dangerous time period.

  • New South Wales, Victoria, and Queensland account for more than 80% of all fatalities.

  • Pedestrian fatalities have decreased over time, but cyclist fatalities have remained relatively constant since the mid-1990s.

Part 1: Data Checking

1.1 Reading Data and Importing Libraries

Import the necessary libraries and read the dataset into a pandas DataFrame. The dataset is in CSV format and contains information about fatal road accidents in Australia from 1989. The dataset is updated monthly and includes various attributes such as the date of the accident, location, vehicle type, and demographic information about the individuals involved in the accidents.

Code
# Set the directory for the script

import sys
sys.path.append("../scripts") 

# Importing the required libraries

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

# Using Pandas to read the csv file and store it in a dataframe

df = pd.read_csv("../data/Crash_Data.csv", low_memory=False)

1.2 Exploring the Data and Checking the Data Types

Briefly inspect to check column names, types, summary statistics, and overall structure.

Code
# View the first and last five rows of the dataset

display(df.head())
display(df.tail())
Crash ID State Month Year Dayweek Time Crash Type Bus Involvement Articulated Truck Involvement Speed Limit Road User Gender Age Christmas Period Easter Period Month Name Age Group Time of day Day of week
0 120191210344 NSW 8 2019 Friday 19:30:00 Multiple No No 80.0 Passenger Male 88.0 No No August 75_or_older Night Weekend
1 3346487597363538801 WA 11 2022 Tuesday 20:25:00 Single No No 70.0 Driver Male 53.0 No No November 40_to_64 Night Weekday
2 3199201070009 QLD 1 1992 Tuesday 14:00:00 Single No No 100.0 Driver Female 20.0 No No January 17_to_25 Day Weekday
3 3935365546040620478 VIC 10 2024 Thursday 13:54:00 Multiple No No 100.0 Driver Male 34.0 No No October 26_to_39 Day Weekday
4 2200905020093 VIC 5 2009 Saturday 01:19:00 Single No No 80.0 Driver Male 33.0 No No May 26_to_39 Night Weekend
Crash ID State Month Year Dayweek Time Crash Type Bus Involvement Articulated Truck Involvement Speed Limit Road User Gender Age Christmas Period Easter Period Month Name Age Group Time of day Day of week
58171 2199509170258 VIC 9 1995 Sunday 10:25:00 Single No No 80.0 Motorcycle rider Male 24.0 No No September 17_to_25 Day Weekend
58172 5200910120120 WA 10 2009 Monday 14:30:00 Single No No 110.0 Passenger Female 8.0 No No October 0_to_16 Day Weekday
58173 5201502220022 WA 2 2015 Sunday 07:30:00 Single No No 110.0 Passenger Male 30.0 No No February 26_to_39 Day Weekend
58174 1198901130024 NSW 1 1989 Friday 17:35:00 Multiple No Yes 60.0 Driver Female 62.0 No No January 40_to_64 Day Weekday
58175 3199306170130 QLD 6 1993 Thursday 09:00:00 Multiple No No 60.0 Passenger Male 9.0 No No June 0_to_16 Day Weekday
Code
# Get information about the dataset
display(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58176 entries, 0 to 58175
Data columns (total 19 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Crash ID                       58176 non-null  int64  
 1   State                          58176 non-null  object 
 2   Month                          58176 non-null  int64  
 3   Year                           58176 non-null  int64  
 4   Dayweek                        58176 non-null  object 
 5   Time                           58176 non-null  object 
 6   Crash Type                     58167 non-null  object 
 7   Bus Involvement                58122 non-null  object 
 8   Articulated Truck Involvement  58160 non-null  object 
 9   Speed Limit                    56695 non-null  float64
 10  Road User                      58021 non-null  object 
 11  Gender                         58144 non-null  object 
 12  Age                            58068 non-null  float64
 13  Christmas Period               58176 non-null  object 
 14  Easter Period                  58176 non-null  object 
 15  Month Name                     58176 non-null  object 
 16  Age Group                      58068 non-null  object 
 17  Time of day                    58135 non-null  object 
 18  Day of week                    58135 non-null  object 
dtypes: float64(2), int64(3), object(14)
memory usage: 8.4+ MB
None
Code
# Take the number of rows and columns in the dataset, and the number of unique Crash ID's, and print them out in a sentence. 

print(f"Dataset contains {df.shape[0]:,} rows and {df.shape[1]} columns, "f"with {df['Crash ID'].nunique():,} unique crash IDs.")
Dataset contains 58,176 rows and 19 columns, with 52,500 unique crash IDs.

The table below provides a high-level summary of each column, including: - The number of unique values - The most frequently occurring value (where applicable) - The count of that most common value - The number of missing entries

Note: For identifier fields like Crash ID, the “top value” has limited interpretive value, but is more informative in fields such as State or Age.

Code
# Creating and populating a summary dataframe using pandas
summary_df = pd.DataFrame({
    'Column Name': df.columns, # Getting the column names
    'Unique Values': [df[col].nunique() for col in df.columns], # Counting the number of unique values in each column
    'Most Common Value': [df[col].mode()[0] for col in df.columns], # Finding the most common value in each column using mode function
    'Count of Most Common Value': [df[col].value_counts().iloc[0] for col in df.columns], # Counting the number of times the most common value appears in each column
    'Missing Values': df.isnull().sum().values # Counting the number of blank values in each column
})

# Displaying the summary dataframe
display(summary_df)
Column Name Unique Values Most Common Value Count of Most Common Value Missing Values
0 Crash ID 52500 1198912220772 35 0
1 State 8 NSW 17668 0
2 Month 12 12 5277 0
3 Year 37 1989 2800 0
4 Dayweek 7 Saturday 10529 0
5 Time 1425 15:00:00 1247 0
6 Crash Type 2 Single 32157 9
7 Bus Involvement 2 No 57059 54
8 Articulated Truck Involvement 2 No 52378 16
9 Speed Limit 16 100.0 19862 1481
10 Road User 6 Driver 26234 155
11 Gender 2 Male 41826 32
12 Age 102 18.0 2062 108
13 Christmas Period 2 No 56358 0
14 Easter Period 2 No 57820 0
15 Month Name 12 December 5277 0
16 Age Group 6 40_to_64 15047 108
17 Time of day 2 Day 33355 41
18 Day of week 2 Weekday 34349 41

47,567 unique crashes resulted in 52,843 fatalities — some events claimed multiple lives. The deadliest single crash on record involved 35 deaths (the 1989 Kempsey bus disaster).

Code
# Calculate statistics for the 'Age' column

mean_age = df['Age'].mean()
median_age = df['Age'].median()
min_age = df['Age'].min()
max_age = df['Age'].max()

print(f"The average age of individuals involved in fatal road crashes was {mean_age:.1f} years "f"(median: {median_age}, range: {min_age}{max_age}).")
The average age of individuals involved in fatal road crashes was 40.3 years (median: 35.0, range: 0.0–101.0).

Part 2: Data Preparation

2.1 Checking for null or missing values

From information provided in the ARDD data dictionary we know that missing values are represented by ‘-9’, ‘Unspecified’ or ‘Other/-9’. We will search for these values and replace them with null values (NaN).

Code
# Firstly, we will find the number of blank values in the dataset
blank_values = df.isnull().sum()

# Secondly, we will find the number of values set to '-9' which according to the data book are also missing data. 
neg_nine_values = df.isin([-9, 'Other/-9']).sum()

# Thirdly there are a handful of 'Unspecified' value in the dataset, so we will also sum them
unspecified_values = (df == "Unspecified").sum()

# We will now sum all the missing values
total_missing = blank_values + neg_nine_values + unspecified_values

# Displaying counts of blank, '-9', 'unspecified', and total missing values per column

missing_data_summary = pd.DataFrame({
    'Blank Values': blank_values,
    '-9 Values': neg_nine_values,
    'Unspecified Values': unspecified_values,
    'Total of Missing Values': total_missing
})

display(missing_data_summary)
Blank Values -9 Values Unspecified Values Total of Missing Values
Crash ID 0 0 0 0
State 0 0 0 0
Month 0 0 0 0
Year 0 0 0 0
Dayweek 0 0 0 0
Time 0 0 0 0
Crash Type 9 0 0 9
Bus Involvement 54 0 0 54
Articulated Truck Involvement 16 0 0 16
Speed Limit 1481 0 0 1481
Road User 155 0 0 155
Gender 32 0 0 32
Age 108 0 0 108
Christmas Period 0 0 0 0
Easter Period 0 0 0 0
Month Name 0 0 0 0
Age Group 108 0 0 108
Time of day 41 0 0 41
Day of week 41 0 0 41

Null or missing values will be replaced with NaN for easier analysis and columns with high proportions of missing data will be dropped from further analysis to ensure data quality and avoid skewed interpretations.

Heavy Rigid Truck Involvement, National Remoteness Areas, SA4 Name 2016, National LGA Name 2017, and National Road Type will be dropped.

2.2 Data Cleaning

Data is cleaned using a script that: - Replaces missing values with NaN. - Map numeric month values to month names and create a new column for month names. - Drops columns with high proportions of missing data. - Drops entries from the current year as that data is not yet complete.

Code
from data_cleaning import full_clean_pipeline

df = full_clean_pipeline(df)

# Create a dynamic year variable to get the latest year in the dataset to use in the analysis
latest_year = df['Year'].max()
earliest_year = df['Year'].min()
print(f"The dataset contains data from {earliest_year} to {latest_year}.")
The dataset contains data from 1989 to 2025.

2.3 Visualise the Remaining Missing Data

A quick visualisation of the remaining missing data will be created to help identify any remaining issues.

Code
# Creating a bar chart of the missing data
sns.set_style('whitegrid')
missing_value_count = df.isnull().sum() # We could instead call on the missing_data_summary dataframe using the 'total_missing' column, but defining a new variable is easier to read and shorter to type
plt.figure(figsize=(16,8))
sns.barplot(x=missing_value_count.index, 
            y=missing_value_count)
plt.title('Count of Missing Values per Column')
plt.xlabel('Column Name')
plt.ylabel('Count of missing values')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

The count chart is hard to read — a couple of columns dominate. The percentage view below (capped at 10%) makes the smaller gaps easier to see.

Code
# Creating a bar chart of the percentage of missing data
sns.set_style('whitegrid')
missing_values_percentage = (missing_value_count / len(df)) * 100
plt.figure(figsize=(16,8))
sns.barplot(x=missing_values_percentage.index, 
            y=missing_values_percentage)
plt.title('Percentage of Missing Values per Column')
plt.xlabel('Column Name')
plt.ylabel('Percentage of missing values (0 to 10%)')
plt.xticks(rotation=90)
plt.yticks(range(0, 11, 1))
plt.tight_layout()
plt.show()

Speed limit has the highest share of missing values at around 2.6%. Everything else is under 0.5% — low enough to leave in place and handle on a chart-by-chart basis.

2.4 Check for Duplicate Entries

Code
# Checking for duplicate rows
duplicate_rows = df[df.duplicated()]
display(duplicate_rows)

# Checking for duplicate values in the Crash ID column
duplicate_crash_ids = df[df.duplicated(['Crash ID'])]
display(duplicate_crash_ids)
Crash ID State Month Year Dayweek Time Crash Type Bus Involvement Articulated Truck Involvement Speed Limit Road User Gender Age Christmas Period Easter Period Month Name Age Group Time of day Day of week
4760 3199410240286 QLD 10 1994 Monday 10:00:00 Single Yes No 100.0 Passenger Female 72.0 No No October 65_to_74 Day Weekday
5234 1199107070301 NSW 7 1991 Sunday 23:30:00 Single No No 60.0 Passenger Male 18.0 No No July 17_to_25 Night Weekend
5439 6198901110004 TAS 1 1989 Wednesday 20:20:00 Multiple No Yes 100.0 Passenger Male 13.0 No No January 0_to_16 Night Weekday
6650 320201179894 QLD 6 2020 Sunday 04:00:00 Single No No 70.0 Passenger Female 14.0 No No June 0_to_16 Night Weekend
6797 2201111120221 VIC 11 2011 Saturday 12:37:00 Multiple No No 100.0 Passenger Female 20.0 No No November 17_to_25 Day Weekend
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
57433 3200011160243 QLD 11 2000 Thursday 15:00:00 Multiple No No 100.0 Passenger Female 68.0 No No November 65_to_74 Day Weekday
57502 3199903110042 QLD 3 1999 Thursday 21:00:00 Multiple No Yes 60.0 Passenger Female 17.0 No No March 17_to_25 Night Weekday
57746 4201501220007 SA 1 2015 Thursday 16:44:00 Multiple No Yes 110.0 Passenger Male 33.0 No No January 26_to_39 Day Weekday
57805 4201001120005 SA 1 2010 Tuesday 13:20:00 Multiple No No 100.0 Passenger Female 10.0 No No January 0_to_16 Day Weekday
58124 72017150290 NT 2 2017 Saturday 02:26:00 Multiple No No 100.0 Pedestrian Male 15.0 No No February 0_to_16 Night Weekend

166 rows × 19 columns

Crash ID State Month Year Dayweek Time Crash Type Bus Involvement Articulated Truck Involvement Speed Limit Road User Gender Age Christmas Period Easter Period Month Name Age Group Time of day Day of week
828 1200008150335 NSW 8 2000 Tuesday 14:30:00 Multiple No Yes 100.0 Passenger Female 24.0 No No August 17_to_25 Day Weekday
880 1198911140670 NSW 11 1989 Tuesday 16:10:00 Single No No 60.0 Passenger Female 34.0 No No November 26_to_39 Day Weekday
1053 1199803100082 NSW 3 1998 Tuesday 15:30:00 Multiple No No 100.0 Driver Male 52.0 No No March 40_to_64 Day Weekday
1101 3199307300180 QLD 7 1993 Friday 00:00:00 Single No No 60.0 Passenger Male 44.0 No No July 40_to_64 Night Weekday
1207 4201813570 WA 1 2018 Wednesday 12:14:00 Multiple No No 110.0 Driver Female 49.0 No No January 40_to_64 Day Weekday
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
58139 1198910110598 NSW 10 1989 Wednesday 21:15:00 Single No No 100.0 Driver Female 17.0 No No October 17_to_25 Night Weekday
58145 4201913952 WA 5 2019 Thursday 00:29:00 Single No No NaN Pedestrian Female 23.0 No No May 17_to_25 Night Weekday
58161 3201105130075 QLD 5 2011 Friday 23:00:00 Single No No 60.0 Driver Male 27.0 No No May 26_to_39 Night Weekend
58172 5200910120120 WA 10 2009 Monday 14:30:00 Single No No 110.0 Passenger Female 8.0 No No October 0_to_16 Day Weekday
58175 3199306170130 QLD 6 1993 Thursday 09:00:00 Multiple No No 60.0 Passenger Male 9.0 No No June 0_to_16 Day Weekday

5676 rows × 19 columns

In this dataset, Crash IDs represent a single event, but there are multiple rows for each Crash ID in some cases when there have been multiple fatalities in a crash. Since there appears to be no true duplicate rows, we will not drop any rows from the dataset.

2.5 Checking for Outliers

In this dataset we have a single continuous variable of interest, age. We will use a boxplot to check for outliers in this column.

Code
# Creating a boxplot to check for outliers in the Age column
plt.figure(figsize=(8,3))
sns.boxplot(x=df['Age'])
plt.title('Boxplot of Age')
plt.xlabel('Age')
plt.show()

No outliers visible. The IQR check below confirms this.

Code
# Creating a function to calculate the IRQ
def IRQ_and_bounds(column):
    Q1 = column.quantile(0.25)
    Q3 = column.quantile(0.75)
    IRQ = Q3 - Q1
    lower_bound = Q1 - (1.5 * IRQ)
    upper_bound = Q3 + (1.5 * IRQ)
    return IRQ, lower_bound, upper_bound

# Using the IRQ function to calculate the IRQ of the Age column
IRQ_and_bounds(df['Age'])

# Using the IRQ function to check for outliers in the Age column

irq_value, lower_bound, upper_bound = IRQ_and_bounds(df['Age'])

print(f"IRQ: {irq_value}, Lower Bound: {lower_bound}, Upper Bound: {upper_bound}")

# Using the IRQ function to check for outliers in the Age column
def identify_outliers(column):
    _, lower_bound, upper_bound = IRQ_and_bounds(column)
    outliers = column[(column < lower_bound) | (column > upper_bound)]
    return outliers

outliers = identify_outliers(df['Age'])
print(f"Outliers: {outliers.values}")
IRQ: 34.0, Lower Bound: -29.0, Upper Bound: 107.0
Outliers: []

Part 3: Data Visualisation

This section explores patterns in fatal road transport accidents in Australia from 1989 to 2025. Each row in the dataset represents a single fatality and includes demographic, geographic, and temporal details.

The visualisations aim to examine:

  • The demographic characteristics of individuals involved in fatal crashes

  • The geographic distribution of fatal incidents across Australian states and territories

  • Temporal trends, including changes over time and differences by day of week or time of year

  • Shifts in the demographic profile of fatalities over time

  • Trends in pedestrian and cyclist fatalities, and whether these have changed meaningfully since 1989

These visual insights provide context for evaluating the impact of road safety initiatives and identifying groups most at risk.

3.1 Demographic Analysis of Fatal Road Accidents in Australia

Code
# Age and Sex distribution of fatal road transport accidents in Australia
sns.set_style('whitegrid')
plt.figure(figsize=(16,8))
sns.histplot(df['Age'], bins=40,
            kde= False,
            alpha=0.9)
plt.title(f"Age Distribution of Fatal Road Transport Accidents in Australia ({earliest_year} to {latest_year})")
plt.xlabel('Age')
plt.xticks(range(0, 101, 10))
plt.ylabel('Count')
plt.show()

Fatalities peak sharply among people in their early 20s, then taper off steadily with age.

Code
# Plot of fatal accidents by gender

sns.set_style('whitegrid')
plt.figure(figsize=(4,6))
sns.countplot(x='Gender', 
            data=df,
            alpha=0.9)
plt.title(f"Gender Distribution of Fatal Accidents in Australia by\nCount ({earliest_year} to {latest_year})")
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()

Males make up the clear majority — the chart below expresses this as a percentage.

Code
# Gender distribution of fatal accidents by gender as a percentage
gender_percentage = (df['Gender'].value_counts(normalize=True) * 100).reset_index()
gender_percentage.columns = ['Gender', 'Percentage']

# Plotting
plt.figure(figsize=(4,6))
sns.set_style('whitegrid')
sns.barplot(x='Gender', 
            y='Percentage', 
            data=gender_percentage, 
            alpha=0.9)
plt.title(f"Gender Distribution of Fatal Accidents in Australia\nas a Percentage ({earliest_year} to {latest_year})")
plt.xlabel('Gender')
plt.ylabel('Percentage')
plt.show()

Around 70% of road fatalities are male — a ratio that, as we’ll see later, has stayed remarkably consistent over time.

Code
# Density plot of age and gender
plt.figure(figsize=(16,8))
sns.set_style('whitegrid')
sns.kdeplot(data=df, 
            x='Age', 
            hue='Gender', 
            fill=True, 
            palette='rainbow', 
            alpha=0.5)
plt.title(f"Age and Gender Distribution of Fatal Accidents in Australia ({earliest_year} to {latest_year})")
plt.xlabel('Age')
plt.ylabel('Density')
plt.show()

Both genders peak around age 20 and follow a broadly similar distribution. The difference is largely one of scale, not shape.

Code
# Box plot of age and gender distribution
plt.figure(figsize=(4,6))
sns.set_style('whitegrid')
sns.boxplot(x='Gender', 
            y='Age', 
            data=df)
plt.title(f"Box Plot of Age and Gender Distribution of Fatal Accidents\nin Australia ({earliest_year} to {latest_year})")
plt.xlabel('Gender')
plt.ylabel('Age')
plt.show()

Male fatalities skew slightly older at the median, with a handful of outliers at the upper end.

Code
# Calculate the number of accidents by road user
road_user_counts = df['Road User'].value_counts().reset_index()
road_user_counts.columns = ['Road User', 'Accidents']

# Fatalities by Road User Type
sns.set_style('whitegrid')
plt.figure(figsize=(12,6))
sns.barplot(x='Accidents', 
            y='Road User', 
            data=road_user_counts, 
            hue='Road User', 
            palette='rainbow')
plt.title(f"Number of Fatal Accidents between ({earliest_year} to {latest_year}) by Road User Type")
plt.xlabel('Number of Accidents')
plt.ylabel('Road User Type')
plt.tight_layout()
plt.show()

Drivers and passengers make up the bulk of fatalities — unsurprisingly, given how many more cars there are on the road than anything else. Pedestrians and motorcyclists follow, and both are examined in more detail in the sub-analyses.

Code
# Calculate the number of accidents by road user
age_group_fatalities = df.groupby(['Year', 'Age Group'], observed=True)['Crash ID'].size().reset_index(name='Fatalities')

# Using a pivot table to transform the data into wide format
age_group_fatalities_pivot = age_group_fatalities.pivot(index='Year', 
                                                        columns='Age Group', 
                                                        values='Fatalities')
age_group_fatalities_pivot.columns = age_group_fatalities_pivot.columns.str.replace('_', ' ') # Remove underscores from column names so they appear more nearly in the legend

# Plotting
sns.set_style('whitegrid')
age_group_fatalities_pivot.plot(kind='bar', 
                                stacked=True, 
                                figsize=(12, 6), 
                                cmap='rainbow')
plt.title(f"Fatalities by Age Group Since {earliest_year} to {latest_year}")
plt.xticks(rotation=45)
plt.xlabel('Year')
plt.ylabel('Fatalities')
plt.legend(title='Age Group')
plt.show()

Fatalities have fallen across all age groups since 1989, with the proportional mix between groups staying broadly consistent.

Code
# Calculate the number of accidents by road user 
road_user_fatalities = df.groupby(['Year', 'Road User'])['Crash ID'].size().reset_index(name='Fatalities')

# Using a pivot table to transform the data into wide format
road_user_fatalities_pivot = road_user_fatalities.pivot(index='Year', 
                                                        columns='Road User', 
                                                        values='Fatalities')

# Plotting
sns.set_style('whitegrid')
road_user_fatalities_pivot.plot(kind='bar', 
                                stacked=True, 
                                figsize=(12, 6), 
                                cmap='jet')
plt.title(f"Fatalities by Road User Type ({earliest_year} to {latest_year})")
plt.xlabel('Year')
plt.ylabel('Fatalities')
plt.xticks(rotation=45)
plt.legend(title='Road User Type')
plt.show()

The same broad decline holds across all road user types, with no major shifts in the relative mix over time.

3.2 Geographic Analysis of Fatal Road Accidents in Australia

Code
# Pie chart of fatal accidents by State
# Firstly, calculate the number of accidents by state. 
state_counts = df['State'].value_counts().reset_index()
state_counts.columns = ['State', 'Accidents']

plt.figure(figsize=(10,10))
plt.pie(x=state_counts['Accidents'], 
        labels=state_counts['State'], 
        autopct='%1.1f%%', 
        startangle=90)
plt.title(f"Percentage of Fatal Accidents by State ({earliest_year} to {latest_year})")
plt.show()

NSW, Victoria, and Queensland together account for around 80% of fatalities — broadly in line with their share of the national population.

Code
# Bar chart of fatal accidents by State
sns.set_style('whitegrid')
plt.figure(figsize=(12,6))
sns.barplot(x='Accidents', 
            y='State', 
            data=state_counts,
            hue='State',
            alpha=0.9,
            palette='rainbow')
plt.title(f"Number of Fatal Accidents by State ({earliest_year} to {latest_year})")
plt.xlabel('Number of Accidents')
plt.ylabel('State')
plt.show()

Same data, different format — easier to compare state counts directly.

3.3 Temporal Analysis of Fatal Road Accidents in Australia

Accidents by Month of Year

Code
# Now we will calculate the number of accidents by month

month_counts = df['Month Name'].value_counts().reset_index()
month_counts.columns = ['Month', 'Accidents']

# Sorting the months in order by converting them to a categorical variable and sorting by the month order

month_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July',
            'August', 'September', 'October', 'November', 'December']
month_counts['Month'] = pd.Categorical(month_counts['Month'], categories=month_order, ordered=True)
month_counts = month_counts.sort_values('Month')
Code
# Plot the number of accidents by month

sns.set_style('whitegrid')
plt.figure(figsize=(12,6))
sns.barplot(x='Month', 
            y='Accidents', 
            data=month_counts, 
            hue='Month', 
            palette='rainbow',
            alpha=0.9)
plt.title(f"Number of Fatal Accidents by Month ({earliest_year} to {latest_year})")
plt.xlabel('Month')
plt.ylabel('Number of Accidents')
plt.xticks(rotation=45)
plt.yticks(range(0, 5500, 500))
plt.show()

Fatalities are fairly evenly distributed throughout the year, with a modest uptick in December and March — both associated with increased traffic volumes and holiday travel.

Code
# Count accidents by day of the week
day_counts = df['Dayweek'].value_counts().reset_index()
day_counts.columns = ['Day', 'Accidents']

# Sort the days in order
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
day_counts['Day'] = pd.Categorical(day_counts['Day'], categories=day_order, ordered=True)
day_counts = day_counts.sort_values('Day')
Code
# Plot the number of accidents by day of the week

sns.set_style('whitegrid')
plt.figure(figsize=(12,6))
sns.barplot(x='Day', 
            y='Accidents', 
            data=day_counts, 
            hue='Day', 
            palette='rainbow',
            alpha=0.9)
plt.title(f"Number of Fatal Accidents by Day of the Week ({earliest_year} to {latest_year})")
plt.xlabel('Day of the Week')
plt.yticks(range(0, 12000, 1000))
plt.ylabel('Number of Accidents')
plt.xticks(rotation=45)
plt.show()

Saturday records the most fatalities, followed by Friday and Sunday. The weekend effect is clear and consistent.

Code
day_counts['Percentage'] = (day_counts['Accidents'] / day_counts['Accidents'].sum()) * 100
sns.set_style('whitegrid')
plt.figure(figsize=(12,6))
sns.barplot(x='Day', 
            y='Percentage', 
            data=day_counts, 
            hue='Day', 
            palette='rainbow')
plt.title(f"Percentage of Fatal Accidents by Day of the Week ({earliest_year} to {latest_year})")
plt.xlabel('Day of the Week')
plt.ylabel('Percentage of Accidents (%)')
plt.xticks(rotation=45)
plt.show()

A day-of-week and hour-of-day heatmap makes the pattern much clearer.

Code
# We will create a new data frame for this visualisation because we have to do some manipulation that involves dropping rows. 

# Creating a new dataframe
heatmap_df = df[['Time', 'Dayweek']].copy()

# Dropping rows with missing values
heatmap_df = heatmap_df.dropna(subset=['Time'])

# Then we will extract the hour from the time field to make it easier to create a heatmap
heatmap_df['Hour'] = heatmap_df['Time'].str.split(':').str[0].astype(int)

# Next we need to create a pivot table to convert the data into wide format
pivot_table = pd.pivot_table(heatmap_df, 
                            values='Time', 
                            index=['Hour'], 
                            columns=['Dayweek'], 
                            aggfunc='count', 
                            fill_value=0)

# Finally plotting a heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(pivot_table[day_order], 
            annot=True, 
            cmap='RdYlGn_r', # Take a colour pallet from https://loading.io/color/feature/RdYlGn-9/ and use _r to flip it so that red is higher and green is lower
            fmt='g')
plt.title(f"Heatmap of Fatal Accidents ({earliest_year} to {latest_year}) by Day of Week and Hour of Day")
plt.xlabel('Day of the Week')
plt.ylabel('Hour of Day')
plt.show()

Friday and Saturday afternoons stand out clearly — the 3pm Friday slot in particular. Late-night weekend hours also see elevated numbers.

Change over time in the number of fatal accidents by year

Code
# Creating a line graph of fatalities by year
fatalities_per_year = df.groupby('Year')['Crash ID'].size().reset_index()
fatalities_per_year.columns = ['Year', 'Fatalities']

# Creating a Line Plot
sns.set_style('whitegrid')
plt.figure(figsize=(12, 6))
sns.lineplot(x='Year', 
            y='Fatalities',
            data=fatalities_per_year,
            marker="o")
plt.title(f"Total Number of Fatalities by Year ({earliest_year} to {latest_year})")
plt.xlabel('Year')
plt.ylabel('Number of Fatalities')
plt.grid(True)
plt.show()

Road fatalities have declined substantially since 1989, but have risen from their historic low of around 1,100 in 2020–21 to levels not seen since 2016. Since Australia’s population has grown considerably over this period, raw counts alone can be misleading — a population-adjusted measure gives a more accurate picture of road safety progress.

Code
# We find population data for Australia from 1989 to 2021 from the United Nations website: https://population.un.org/dataportal/data/indicators/49/locations/36/start/1989/end/2021/table/pivotbylocation
# Using this data we create a dictionary with Year as the index. 

australian_population_data =  {
    'Year': list(range(1989, 2026)),
    'Population': [
        16796588, 17048003, 17271086, 17462504, 17631511, 17805504, 18003000,
        18211845, 18410250, 18601667, 18800892, 19017963, 19248143, 19475844,
        19698999, 19925056, 20171731, 20467030, 20830828, 21247873, 21660892,
        22019168, 22357034, 22729269, 23111782, 23469579, 23820236, 24195701,
        24590334, 24979230, 25357170, 25670051, 25921089, 26200985, 26451124,
        26713205, 26974026
    ]
}

# Creating a dataframe from the dictionary
population_df = pd.DataFrame(australian_population_data)

# Merge the population data with the fatalities data
merged_df = pd.merge(fatalities_per_year, population_df, on='Year', how='inner')

# Calculate the number of fatalities per 100,000 people
merged_df['Fatalities per 100k'] = (merged_df['Fatalities'] / merged_df['Population']) * 100000

# Line plot of fatalities per 100,000 people
sns.set_style('whitegrid')
plt.figure(figsize=(12, 6))
sns.lineplot(x='Year', 
            y='Fatalities per 100k', 
            data=merged_df,
            marker="o")
plt.title(f"Fatalities per 100,000 Population by Year ({earliest_year} to {latest_year})")
plt.xlabel('Year')
plt.ylabel('Fatalities per 100,000 Population')
plt.grid(True)
plt.show()

When adjusted for population growth, the picture is more nuanced. Fatalities per 100,000 people remain only slightly above the pre-pandemic low, reflecting the fact that the recent rise in raw numbers is partly explained by a larger population base. Population-adjusted figures are generally a more meaningful measure of road safety progress, and on that basis the recent increase, while worth monitoring, is considerably less dramatic than the raw numbers suggest.

Fatalities by gender over time

Code
# Grouping the number of fatalities each year by gender
gender_fatalities = df.groupby(['Year', 'Gender'])['Crash ID'].size().reset_index(name='Fatalities')

# Calculating total fatalities per year
total_fatalities_per_year = gender_fatalities.groupby('Year')['Fatalities'].sum().reset_index(name='Total Fatalities')

# Merging the dataframes and calculating proportions
gender_fatalities = pd.merge(gender_fatalities, total_fatalities_per_year, on='Year')
gender_fatalities['Proportion'] = (gender_fatalities['Fatalities'] / gender_fatalities['Total Fatalities']) * 100

# Pivoting the data for easier plotting
gender_proportions_pivot = gender_fatalities.pivot(index='Year', columns='Gender', values='Proportion')

# Plotting
sns.set_style('whitegrid')
plt.figure(figsize=(12, 6))
sns.lineplot(data=gender_proportions_pivot)
plt.title(f"Proportion of Fatalities by Gender Over Time ({earliest_year} to {latest_year})")
plt.xlabel('Year')
plt.ylabel('Proportion of Fatalities (%)')
plt.grid(color='grey', linestyle='--', linewidth=0.5, alpha=0.5)
plt.legend(title='Gender')
plt.show()

The male-to-female ratio has barely shifted since 1989 — this is a persistent structural pattern, not a recent phenomenon.

Fatalities by age group over time

Code
# Plotting
sns.set_style('whitegrid')
plt.figure(figsize=(12, 6))
sns.lineplot(data=age_group_fatalities_pivot, marker="o", linewidth=2.5) # Use the pivot table created earlier
plt.title(f"Fatalities Over Time by Age Group ({earliest_year} to {latest_year})")
plt.xlabel('Year')
plt.ylabel('Number of Fatalities')
plt.grid(color='gray', linestyle='--', linewidth=0.5)
plt.legend(title='Age Group')
plt.show()

Every age group has seen a decline, but the 17–25 cohort stands out — the drop is steeper than any other, likely reflecting graduated licensing reforms introduced across Australian states in the 1990s and 2000s. Around the same time, 40–64 year olds overtook younger drivers as the age group with the highest annual fatality count.

Fatalities by road user type over time

Code
# Grouping data by year and road user type, and counting fatalities
ped_and_cyclist_fatalities = df[df['Road User'].isin(['Pedal cyclist', 'Pedestrian'])]

# Group data by Year and Road User, counting all Crash ID occurrences
fatalities_by_year_user = ped_and_cyclist_fatalities.groupby(['Year', 'Road User'])['Crash ID'].count().reset_index(name='Fatalities')

# Merge fatalities data with population data
fatalities_with_pop = pd.merge(fatalities_by_year_user, population_df, on='Year', how='left')

# Calculate fatalities per 100,000 people
fatalities_with_pop['Fatalities per 100k'] = (fatalities_with_pop['Fatalities'] / fatalities_with_pop['Population']) * 100000

# Pivot the data
sns.set_style('whitegrid')
plt.figure(figsize=(12, 6))
sns.lineplot(x='Year', 
            y='Fatalities per 100k',
            data=fatalities_with_pop, 
            hue='Road User', 
            marker="o")
plt.title(f"Fatalities per 100,000 Population Over Time by Road User Type ({earliest_year} to {latest_year})")
plt.xlabel('Year')
plt.ylabel('Fatalities per 100,000 Population')
plt.grid(True)
sns.despine()
plt.tight_layout()
plt.show()

Pedestrian fatalities have fallen steadily since 1989. Cyclist deaths have barely moved since the mid-1990s — a gap explored further in the vulnerable road users notebook.

Part 4: Conclusion

This analysis mapped how fatal road crashes in Australia have changed from 1989 to 2025 across people, places, and time.

Key findings include: - A significant overall decline in fatalities, both in absolute terms and per 100,000 population. - The largest reduction occurred among 17–25-year-old drivers, potentially linked to licensing reforms targeting provisional licence holders. - Males consistently account for approximately 70% of road fatalities throughout the observed period. - Fatal accidents are most concentrated between Friday and Sunday, with Friday afternoons posing the highest risk. - New South Wales, Victoria, and Queensland collectively represent over 80% of national road fatalities. - Pedestrian fatalities have decreased over time, while cyclist fatalities have remained relatively stable since the mid-1990s.

Opportunities for Further Research

This dataset covers fatal crashes only. Adding non-fatal injury data would give a fuller view of road harm, not just the most severe outcomes.
It would also be useful to examine injury-to-fatality ratios for pedestrians and cyclists to better understand where safety gains are happening and where they are stalling.