Analysis of Fatal Road Accidents in Australia since 1989

Introduction

This notebook looks at fatal road crashes in Australia from 1989 onward using the Australian Road Deaths Database (ARDD). The focus is simple: who is most affected, where risk is concentrated, and how patterns have shifted over time.

The key objectives of this analysis are to:

Identify the demographic groups most affected by fatal road accidents
Examine how fatalities are distributed across Australian states and territories
Explore patterns in accident timing (days of the week, months of the year)
Analyse trends over time in road fatalities overall and among vulnerable road users (pedestrians, cyclists)
Use straightforward data cleaning and exploratory visualisation to surface practical trends for policy and prevention work.

Key Findings

Road fatalities have steadily declined from 1989 to 2025, both in raw numbers and when adjusted per 100,000 population.
Young drivers (aged 17–25) historically had the highest fatality rates but have seen substantial improvements, likely due to licensing restrictions.
Males account for approximately 70% of all road fatalities, a ratio that has remained consistent over time.
The majority of fatal accidents occur between Friday and Sunday, with Friday afternoon being the most dangerous time period.
New South Wales, Victoria, and Queensland account for more than 80% of all fatalities.
Pedestrian fatalities have decreased over time, but cyclist fatalities have remained relatively constant since the mid-1990s.

Part 1: Data Checking

1.1 Reading Data and Importing Libraries

Import the necessary libraries and read the dataset into a pandas DataFrame. The dataset is in CSV format and contains information about fatal road accidents in Australia from 1989. The dataset is updated monthly and includes various attributes such as the date of the accident, location, vehicle type, and demographic information about the individuals involved in the accidents.

Code

# Set the directory for the script

import sys
sys.path.append("../scripts") 

# Importing the required libraries

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

# Using Pandas to read the csv file and store it in a dataframe

df = pd.read_csv("../data/Crash_Data.csv", low_memory=False)

1.2 Exploring the Data and Checking the Data Types

Briefly inspect to check column names, types, summary statistics, and overall structure.

Code

# View the first and last five rows of the dataset

display(df.head())
display(df.tail())

	Crash ID	State	Month	Year	Dayweek	Time	Crash Type	Bus Involvement	Articulated Truck Involvement	Speed Limit	Road User	Gender	Age	Christmas Period	Easter Period	Month Name	Age Group	Time of day	Day of week
0	120191210344	NSW	8	2019	Friday	19:30:00	Multiple	No	No	80.0	Passenger	Male	88.0	No	No	August	75_or_older	Night	Weekend
1	3346487597363538801	WA	11	2022	Tuesday	20:25:00	Single	No	No	70.0	Driver	Male	53.0	No	No	November	40_to_64	Night	Weekday
2	3199201070009	QLD	1	1992	Tuesday	14:00:00	Single	No	No	100.0	Driver	Female	20.0	No	No	January	17_to_25	Day	Weekday
3	3935365546040620478	VIC	10	2024	Thursday	13:54:00	Multiple	No	No	100.0	Driver	Male	34.0	No	No	October	26_to_39	Day	Weekday
4	2200905020093	VIC	5	2009	Saturday	01:19:00	Single	No	No	80.0	Driver	Male	33.0	No	No	May	26_to_39	Night	Weekend

	Crash ID	State	Month	Year	Dayweek	Time	Crash Type	Bus Involvement	Articulated Truck Involvement	Speed Limit	Road User	Gender	Age	Christmas Period	Easter Period	Month Name	Age Group	Time of day	Day of week
58171	2199509170258	VIC	9	1995	Sunday	10:25:00	Single	No	No	80.0	Motorcycle rider	Male	24.0	No	No	September	17_to_25	Day	Weekend
58172	5200910120120	WA	10	2009	Monday	14:30:00	Single	No	No	110.0	Passenger	Female	8.0	No	No	October	0_to_16	Day	Weekday
58173	5201502220022	WA	2	2015	Sunday	07:30:00	Single	No	No	110.0	Passenger	Male	30.0	No	No	February	26_to_39	Day	Weekend
58174	1198901130024	NSW	1	1989	Friday	17:35:00	Multiple	No	Yes	60.0	Driver	Female	62.0	No	No	January	40_to_64	Day	Weekday
58175	3199306170130	QLD	6	1993	Thursday	09:00:00	Multiple	No	No	60.0	Passenger	Male	9.0	No	No	June	0_to_16	Day	Weekday

Code

# Get information about the dataset
display(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58176 entries, 0 to 58175
Data columns (total 19 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Crash ID                       58176 non-null  int64  
 1   State                          58176 non-null  object 
 2   Month                          58176 non-null  int64  
 3   Year                           58176 non-null  int64  
 4   Dayweek                        58176 non-null  object 
 5   Time                           58176 non-null  object 
 6   Crash Type                     58167 non-null  object 
 7   Bus Involvement                58122 non-null  object 
 8   Articulated Truck Involvement  58160 non-null  object 
 9   Speed Limit                    56695 non-null  float64
 10  Road User                      58021 non-null  object 
 11  Gender                         58144 non-null  object 
 12  Age                            58068 non-null  float64
 13  Christmas Period               58176 non-null  object 
 14  Easter Period                  58176 non-null  object 
 15  Month Name                     58176 non-null  object 
 16  Age Group                      58068 non-null  object 
 17  Time of day                    58135 non-null  object 
 18  Day of week                    58135 non-null  object 
dtypes: float64(2), int64(3), object(14)
memory usage: 8.4+ MB

None

Code

# Take the number of rows and columns in the dataset, and the number of unique Crash ID's, and print them out in a sentence. 

print(f"Dataset contains {df.shape[0]:,} rows and {df.shape[1]} columns, "f"with {df['Crash ID'].nunique():,} unique crash IDs.")

Dataset contains 58,176 rows and 19 columns, with 52,500 unique crash IDs.

The table below provides a high-level summary of each column, including: - The number of unique values - The most frequently occurring value (where applicable) - The count of that most common value - The number of missing entries

Note: For identifier fields like Crash ID, the “top value” has limited interpretive value, but is more informative in fields such as State or Age.

Code

# Creating and populating a summary dataframe using pandas
summary_df = pd.DataFrame({
    'Column Name': df.columns, # Getting the column names
    'Unique Values': [df[col].nunique() for col in df.columns], # Counting the number of unique values in each column
    'Most Common Value': [df[col].mode()[0] for col in df.columns], # Finding the most common value in each column using mode function
    'Count of Most Common Value': [df[col].value_counts().iloc[0] for col in df.columns], # Counting the number of times the most common value appears in each column
    'Missing Values': df.isnull().sum().values # Counting the number of blank values in each column
})

# Displaying the summary dataframe
display(summary_df)

	Column Name	Unique Values	Most Common Value	Count of Most Common Value	Missing Values
0	Crash ID	52500	1198912220772	35	0
1	State	8	NSW	17668	0
2	Month	12	12	5277	0
3	Year	37	1989	2800	0
4	Dayweek	7	Saturday	10529	0
5	Time	1425	15:00:00	1247	0
6	Crash Type	2	Single	32157	9
7	Bus Involvement	2	No	57059	54
8	Articulated Truck Involvement	2	No	52378	16
9	Speed Limit	16	100.0	19862	1481
10	Road User	6	Driver	26234	155
11	Gender	2	Male	41826	32
12	Age	102	18.0	2062	108
13	Christmas Period	2	No	56358	0
14	Easter Period	2	No	57820	0
15	Month Name	12	December	5277	0
16	Age Group	6	40_to_64	15047	108
17	Time of day	2	Day	33355	41
18	Day of week	2	Weekday	34349	41

47,567 unique crashes resulted in 52,843 fatalities — some events claimed multiple lives. The deadliest single crash on record involved 35 deaths (the 1989 Kempsey bus disaster).

Code

# Calculate statistics for the 'Age' column

mean_age = df['Age'].mean()
median_age = df['Age'].median()
min_age = df['Age'].min()
max_age = df['Age'].max()

print(f"The average age of individuals involved in fatal road crashes was {mean_age:.1f} years "f"(median: {median_age}, range: {min_age}–{max_age}).")

The average age of individuals involved in fatal road crashes was 40.3 years (median: 35.0, range: 0.0–101.0).

Part 2: Data Preparation

2.1 Checking for null or missing values

From information provided in the ARDD data dictionary we know that missing values are represented by ‘-9’, ‘Unspecified’ or ‘Other/-9’. We will search for these values and replace them with null values (NaN).

Code

# Firstly, we will find the number of blank values in the dataset
blank_values = df.isnull().sum()

# Secondly, we will find the number of values set to '-9' which according to the data book are also missing data. 
neg_nine_values = df.isin([-9, 'Other/-9']).sum()

# Thirdly there are a handful of 'Unspecified' value in the dataset, so we will also sum them
unspecified_values = (df == "Unspecified").sum()

# We will now sum all the missing values
total_missing = blank_values + neg_nine_values + unspecified_values

# Displaying counts of blank, '-9', 'unspecified', and total missing values per column

missing_data_summary = pd.DataFrame({
    'Blank Values': blank_values,
    '-9 Values': neg_nine_values,
    'Unspecified Values': unspecified_values,
    'Total of Missing Values': total_missing
})

display(missing_data_summary)

	Blank Values	Total of Missing Values
Crash ID	0	0
State	0	0
Month	0	0
Year	0	0
Dayweek	0	0
Time	0	0
Crash Type	9	9
Bus Involvement	54	54
Articulated Truck Involvement	16	16
Speed Limit	1481	1481
Road User	155	155
Gender	32	32
Age	108	108
Christmas Period	0	0
Easter Period	0	0
Month Name	0	0
Age Group	108	108
Time of day	41	41
Day of week	41	41

Null or missing values will be replaced with NaN for easier analysis and columns with high proportions of missing data will be dropped from further analysis to ensure data quality and avoid skewed interpretations.

Heavy Rigid Truck Involvement, National Remoteness Areas, SA4 Name 2016, National LGA Name 2017, and National Road Type will be dropped.

2.2 Data Cleaning

Data is cleaned using a script that: - Replaces missing values with NaN. - Map numeric month values to month names and create a new column for month names. - Drops columns with high proportions of missing data. - Drops entries from the current year as that data is not yet complete.

Code

from data_cleaning import full_clean_pipeline

df = full_clean_pipeline(df)

# Create a dynamic year variable to get the latest year in the dataset to use in the analysis
latest_year = df['Year'].max()
earliest_year = df['Year'].min()
print(f"The dataset contains data from {earliest_year} to {latest_year}.")

The dataset contains data from 1989 to 2025.

2.3 Visualise the Remaining Missing Data

A quick visualisation of the remaining missing data will be created to help identify any remaining issues.

Code

# Creating a bar chart of the missing data
sns.set_style('whitegrid')
missing_value_count = df.isnull().sum() # We could instead call on the missing_data_summary dataframe using the 'total_missing' column, but defining a new variable is easier to read and shorter to type
plt.figure(figsize=(16,8))
sns.barplot(x=missing_value_count.index, 
            y=missing_value_count)
plt.title('Count of Missing Values per Column')
plt.xlabel('Column Name')
plt.ylabel('Count of missing values')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

The count chart is hard to read — a couple of columns dominate. The percentage view below (capped at 10%) makes the smaller gaps easier to see.

Code

# Creating a bar chart of the percentage of missing data
sns.set_style('whitegrid')
missing_values_percentage = (missing_value_count / len(df)) * 100
plt.figure(figsize=(16,8))
sns.barplot(x=missing_values_percentage.index, 
            y=missing_values_percentage)
plt.title('Percentage of Missing Values per Column')
plt.xlabel('Column Name')
plt.ylabel('Percentage of missing values (0 to 10%)')
plt.xticks(rotation=90)
plt.yticks(range(0, 11, 1))
plt.tight_layout()
plt.show()

Speed limit has the highest share of missing values at around 2.6%. Everything else is under 0.5% — low enough to leave in place and handle on a chart-by-chart basis.

2.4 Check for Duplicate Entries

Code

# Checking for duplicate rows
duplicate_rows = df[df.duplicated()]
display(duplicate_rows)

# Checking for duplicate values in the Crash ID column
duplicate_crash_ids = df[df.duplicated(['Crash ID'])]
display(duplicate_crash_ids)

	Crash ID	State	Month	Year	Dayweek	Time	Crash Type	Bus Involvement	Articulated Truck Involvement	Speed Limit	Road User	Gender	Age	Christmas Period	Easter Period	Month Name	Age Group	Time of day	Day of week
4760	3199410240286	QLD	10	1994	Monday	10:00:00	Single	Yes	No	100.0	Passenger	Female	72.0	No	No	October	65_to_74	Day	Weekday
5234	1199107070301	NSW	7	1991	Sunday	23:30:00	Single	No	No	60.0	Passenger	Male	18.0	No	No	July	17_to_25	Night	Weekend
5439	6198901110004	TAS	1	1989	Wednesday	20:20:00	Multiple	No	Yes	100.0	Passenger	Male	13.0	No	No	January	0_to_16	Night	Weekday
6650	320201179894	QLD	6	2020	Sunday	04:00:00	Single	No	No	70.0	Passenger	Female	14.0	No	No	June	0_to_16	Night	Weekend
6797	2201111120221	VIC	11	2011	Saturday	12:37:00	Multiple	No	No	100.0	Passenger	Female	20.0	No	No	November	17_to_25	Day	Weekend
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
57433	3200011160243	QLD	11	2000	Thursday	15:00:00	Multiple	No	No	100.0	Passenger	Female	68.0	No	No	November	65_to_74	Day	Weekday
57502	3199903110042	QLD	3	1999	Thursday	21:00:00	Multiple	No	Yes	60.0	Passenger	Female	17.0	No	No	March	17_to_25	Night	Weekday
57746	4201501220007	SA	1	2015	Thursday	16:44:00	Multiple	No	Yes	110.0	Passenger	Male	33.0	No	No	January	26_to_39	Day	Weekday
57805	4201001120005	SA	1	2010	Tuesday	13:20:00	Multiple	No	No	100.0	Passenger	Female	10.0	No	No	January	0_to_16	Day	Weekday
58124	72017150290	NT	2	2017	Saturday	02:26:00	Multiple	No	No	100.0	Pedestrian	Male	15.0	No	No	February	0_to_16	Night	Weekend

166 rows × 19 columns

	Crash ID	State	Month	Year	Dayweek	Time	Crash Type	Bus Involvement	Articulated Truck Involvement	Speed Limit	Road User	Gender	Age	Christmas Period	Easter Period	Month Name	Age Group	Time of day	Day of week
828	1200008150335	NSW	8	2000	Tuesday	14:30:00	Multiple	No	Yes	100.0	Passenger	Female	24.0	No	No	August	17_to_25	Day	Weekday
880	1198911140670	NSW	11	1989	Tuesday	16:10:00	Single	No	No	60.0	Passenger	Female	34.0	No	No	November	26_to_39	Day	Weekday
1053	1199803100082	NSW	3	1998	Tuesday	15:30:00	Multiple	No	No	100.0	Driver	Male	52.0	No	No	March	40_to_64	Day	Weekday
1101	3199307300180	QLD	7	1993	Friday	00:00:00	Single	No	No	60.0	Passenger	Male	44.0	No	No	July	40_to_64	Night	Weekday
1207	4201813570	WA	1	2018	Wednesday	12:14:00	Multiple	No	No	110.0	Driver	Female	49.0	No	No	January	40_to_64	Day	Weekday
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
58139	1198910110598	NSW	10	1989	Wednesday	21:15:00	Single	No	No	100.0	Driver	Female	17.0	No	No	October	17_to_25	Night	Weekday
58145	4201913952	WA	5	2019	Thursday	00:29:00	Single	No	No	NaN	Pedestrian	Female	23.0	No	No	May	17_to_25	Night	Weekday
58161	3201105130075	QLD	5	2011	Friday	23:00:00	Single	No	No	60.0	Driver	Male	27.0	No	No	May	26_to_39	Night	Weekend
58172	5200910120120	WA	10	2009	Monday	14:30:00	Single	No	No	110.0	Passenger	Female	8.0	No	No	October	0_to_16	Day	Weekday
58175	3199306170130	QLD	6	1993	Thursday	09:00:00	Multiple	No	No	60.0	Passenger	Male	9.0	No	No	June	0_to_16	Day	Weekday

5676 rows × 19 columns

In this dataset, Crash IDs represent a single event, but there are multiple rows for each Crash ID in some cases when there have been multiple fatalities in a crash. Since there appears to be no true duplicate rows, we will not drop any rows from the dataset.

2.5 Checking for Outliers

In this dataset we have a single continuous variable of interest, age. We will use a boxplot to check for outliers in this column.

Code

# Creating a boxplot to check for outliers in the Age column
plt.figure(figsize=(8,3))
sns.boxplot(x=df['Age'])
plt.title('Boxplot of Age')
plt.xlabel('Age')
plt.show()

No outliers visible. The IQR check below confirms this.

Code

# Creating a function to calculate the IRQ
def IRQ_and_bounds(column):
    Q1 = column.quantile(0.25)
    Q3 = column.quantile(0.75)
    IRQ = Q3 - Q1
    lower_bound = Q1 - (1.5 * IRQ)
    upper_bound = Q3 + (1.5 * IRQ)
    return IRQ, lower_bound, upper_bound

# Using the IRQ function to calculate the IRQ of the Age column
IRQ_and_bounds(df['Age'])

# Using the IRQ function to check for outliers in the Age column

irq_value, lower_bound, upper_bound = IRQ_and_bounds(df['Age'])

print(f"IRQ: {irq_value}, Lower Bound: {lower_bound}, Upper Bound: {upper_bound}")

# Using the IRQ function to check for outliers in the Age column
def identify_outliers(column):
    _, lower_bound, upper_bound = IRQ_and_bounds(column)
    outliers = column[(column < lower_bound) | (column > upper_bound)]
    return outliers

outliers = identify_outliers(df['Age'])
print(f"Outliers: {outliers.values}")

IRQ: 34.0, Lower Bound: -29.0, Upper Bound: 107.0
Outliers: []

Part 3: Data Visualisation

This section explores patterns in fatal road transport accidents in Australia from 1989 to 2025. Each row in the dataset represents a single fatality and includes demographic, geographic, and temporal details.

The visualisations aim to examine:

The demographic characteristics of individuals involved in fatal crashes
The geographic distribution of fatal incidents across Australian states and territories
Temporal trends, including changes over time and differences by day of week or time of year
Shifts in the demographic profile of fatalities over time
Trends in pedestrian and cyclist fatalities, and whether these have changed meaningfully since 1989

These visual insights provide context for evaluating the impact of road safety initiatives and identifying groups most at risk.

3.1 Demographic Analysis of Fatal Road Accidents in Australia

Code

# Age and Sex distribution of fatal road transport accidents in Australia
sns.set_style('whitegrid')
plt.figure(figsize=(16,8))
sns.histplot(df['Age'], bins=40,
            kde= False,
            alpha=0.9)
plt.title(f"Age Distribution of Fatal Road Transport Accidents in Australia ({earliest_year} to {latest_year})")
plt.xlabel('Age')
plt.xticks(range(0, 101, 10))
plt.ylabel('Count')
plt.show()

Fatalities peak sharply among people in their early 20s, then taper off steadily with age.

Code

# Plot of fatal accidents by gender

sns.set_style('whitegrid')
plt.figure(figsize=(4,6))
sns.countplot(x='Gender', 
            data=df,
            alpha=0.9)
plt.title(f"Gender Distribution of Fatal Accidents in Australia by\nCount ({earliest_year} to {latest_year})")
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()

Males make up the clear majority — the chart below expresses this as a percentage.

Code

# Gender distribution of fatal accidents by gender as a percentage
gender_percentage = (df['Gender'].value_counts(normalize=True) * 100).reset_index()
gender_percentage.columns = ['Gender', 'Percentage']

# Plotting
plt.figure(figsize=(4,6))
sns.set_style('whitegrid')
sns.barplot(x='Gender', 
            y='Percentage', 
            data=gender_percentage, 
            alpha=0.9)
plt.title(f"Gender Distribution of Fatal Accidents in Australia\nas a Percentage ({earliest_year} to {latest_year})")
plt.xlabel('Gender')
plt.ylabel('Percentage')
plt.show()

Around 70% of road fatalities are male — a ratio that, as we’ll see later, has stayed remarkably consistent over time.

Code

# Density plot of age and gender
plt.figure(figsize=(16,8))
sns.set_style('whitegrid')
sns.kdeplot(data=df, 
            x='Age', 
            hue='Gender', 
            fill=True, 
            palette='rainbow', 
            alpha=0.5)
plt.title(f"Age and Gender Distribution of Fatal Accidents in Australia ({earliest_year} to {latest_year})")
plt.xlabel('Age')
plt.ylabel('Density')
plt.show()

Both genders peak around age 20 and follow a broadly similar distribution. The difference is largely one of scale, not shape.

Code

# Box plot of age and gender distribution
plt.figure(figsize=(4,6))
sns.set_style('whitegrid')
sns.boxplot(x='Gender', 
            y='Age', 
            data=df)
plt.title(f"Box Plot of Age and Gender Distribution of Fatal Accidents\nin Australia ({earliest_year} to {latest_year})")
plt.xlabel('Gender')
plt.ylabel('Age')
plt.show()

Male fatalities skew slightly older at the median, with a handful of outliers at the upper end.

Code

# Calculate the number of accidents by road user
road_user_counts = df['Road User'].value_counts().reset_index()
road_user_counts.columns = ['Road User', 'Accidents']

# Fatalities by Road User Type
sns.set_style('whitegrid')
plt.figure(figsize=(12,6))
sns.barplot(x='Accidents', 
            y='Road User', 
            data=road_user_counts, 
            hue='Road User', 
            palette='rainbow')
plt.title(f"Number of Fatal Accidents between ({earliest_year} to {latest_year}) by Road User Type")
plt.xlabel('Number of Accidents')
plt.ylabel('Road User Type')
plt.tight_layout()
plt.show()

Drivers and passengers make up the bulk of fatalities — unsurprisingly, given how many more cars there are on the road than anything else. Pedestrians and motorcyclists follow, and both are examined in more detail in the sub-analyses.

Code

# Calculate the number of accidents by road user
age_group_fatalities = df.groupby(['Year', 'Age Group'], observed=True)['Crash ID'].size().reset_index(name='Fatalities')

# Using a pivot table to transform the data into wide format
age_group_fatalities_pivot = age_group_fatalities.pivot(index='Year', 
                                                        columns='Age Group', 
                                                        values='Fatalities')
age_group_fatalities_pivot.columns = age_group_fatalities_pivot.columns.str.replace('_', ' ') # Remove underscores from column names so they appear more nearly in the legend

# Plotting
sns.set_style('whitegrid')
age_group_fatalities_pivot.plot(kind='bar', 
                                stacked=True, 
                                figsize=(12, 6), 
                                cmap='rainbow')
plt.title(f"Fatalities by Age Group Since {earliest_year} to {latest_year}")
plt.xticks(rotation=45)
plt.xlabel('Year')
plt.ylabel('Fatalities')
plt.legend(title='Age Group')
plt.show()

Fatalities have fallen across all age groups since 1989, with the proportional mix between groups staying broadly consistent.

Code

# Calculate the number of accidents by road user 
road_user_fatalities = df.groupby(['Year', 'Road User'])['Crash ID'].size().reset_index(name='Fatalities')

# Using a pivot table to transform the data into wide format
road_user_fatalities_pivot = road_user_fatalities.pivot(index='Year', 
                                                        columns='Road User', 
                                                        values='Fatalities')

# Plotting
sns.set_style('whitegrid')
road_user_fatalities_pivot.plot(kind='bar', 
                                stacked=True, 
                                figsize=(12, 6), 
                                cmap='jet')
plt.title(f"Fatalities by Road User Type ({earliest_year} to {latest_year})")
plt.xlabel('Year')
plt.ylabel('Fatalities')
plt.xticks(rotation=45)
plt.legend(title='Road User Type')
plt.show()

The same broad decline holds across all road user types, with no major shifts in the relative mix over time.

3.2 Geographic Analysis of Fatal Road Accidents in Australia

Code

# Pie chart of fatal accidents by State
# Firstly, calculate the number of accidents by state. 
state_counts = df['State'].value_counts().reset_index()
state_counts.columns = ['State', 'Accidents']

plt.figure(figsize=(10,10))
plt.pie(x=state_counts['Accidents'], 
        labels=state_counts['State'], 
        autopct='%1.1f%%', 
        startangle=90)
plt.title(f"Percentage of Fatal Accidents by State ({earliest_year} to {latest_year})")
plt.show()

NSW, Victoria, and Queensland together account for around 80% of fatalities — broadly in line with their share of the national population.

Code

# Bar chart of fatal accidents by State
sns.set_style('whitegrid')
plt.figure(figsize=(12,6))
sns.barplot(x='Accidents', 
            y='State', 
            data=state_counts,
            hue='State',
            alpha=0.9,
            palette='rainbow')
plt.title(f"Number of Fatal Accidents by State ({earliest_year} to {latest_year})")
plt.xlabel('Number of Accidents')
plt.ylabel('State')
plt.show()

Same data, different format — easier to compare state counts directly.

3.3 Temporal Analysis of Fatal Road Accidents in Australia

Accidents by Month of Year

Code

# Now we will calculate the number of accidents by month

month_counts = df['Month Name'].value_counts().reset_index()
month_counts.columns = ['Month', 'Accidents']

# Sorting the months in order by converting them to a categorical variable and sorting by the month order

month_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July',
            'August', 'September', 'October', 'November', 'December']
month_counts['Month'] = pd.Categorical(month_counts['Month'], categories=month_order, ordered=True)
month_counts = month_counts.sort_values('Month')

Code

# Plot the number of accidents by month

sns.set_style('whitegrid')
plt.figure(figsize=(12,6))
sns.barplot(x='Month', 
            y='Accidents', 
            data=month_counts, 
            hue='Month', 
            palette='rainbow',
            alpha=0.9)
plt.title(f"Number of Fatal Accidents by Month ({earliest_year} to {latest_year})")
plt.xlabel('Month')
plt.ylabel('Number of Accidents')
plt.xticks(rotation=45)
plt.yticks(range(0, 5500, 500))
plt.show()

Fatalities are fairly evenly distributed throughout the year, with a modest uptick in December and March — both associated with increased traffic volumes and holiday travel.

Code

# Count accidents by day of the week
day_counts = df['Dayweek'].value_counts().reset_index()
day_counts.columns = ['Day', 'Accidents']

# Sort the days in order
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
day_counts['Day'] = pd.Categorical(day_counts['Day'], categories=day_order, ordered=True)
day_counts = day_counts.sort_values('Day')

Code

# Plot the number of accidents by day of the week

sns.set_style('whitegrid')
plt.figure(figsize=(12,6))
sns.barplot(x='Day', 
            y='Accidents', 
            data=day_counts, 
            hue='Day', 
            palette='rainbow',
            alpha=0.9)
plt.title(f"Number of Fatal Accidents by Day of the Week ({earliest_year} to {latest_year})")
plt.xlabel('Day of the Week')
plt.yticks(range(0, 12000, 1000))
plt.ylabel('Number of Accidents')
plt.xticks(rotation=45)
plt.show()

Saturday records the most fatalities, followed by Friday and Sunday. The weekend effect is clear and consistent.

Code

day_counts['Percentage'] = (day_counts['Accidents'] / day_counts['Accidents'].sum()) * 100
sns.set_style('whitegrid')
plt.figure(figsize=(12,6))
sns.barplot(x='Day', 
            y='Percentage', 
            data=day_counts, 
            hue='Day', 
            palette='rainbow')
plt.title(f"Percentage of Fatal Accidents by Day of the Week ({earliest_year} to {latest_year})")
plt.xlabel('Day of the Week')
plt.ylabel('Percentage of Accidents (%)')
plt.xticks(rotation=45)
plt.show()

A day-of-week and hour-of-day heatmap makes the pattern much clearer.

Code

# We will create a new data frame for this visualisation because we have to do some manipulation that involves dropping rows. 

# Creating a new dataframe
heatmap_df = df[['Time', 'Dayweek']].copy()

# Dropping rows with missing values
heatmap_df = heatmap_df.dropna(subset=['Time'])

# Then we will extract the hour from the time field to make it easier to create a heatmap
heatmap_df['Hour'] = heatmap_df['Time'].str.split(':').str[0].astype(int)

# Next we need to create a pivot table to convert the data into wide format
pivot_table = pd.pivot_table(heatmap_df, 
                            values='Time', 
                            index=['Hour'], 
                            columns=['Dayweek'], 
                            aggfunc='count', 
                            fill_value=0)

# Finally plotting a heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(pivot_table[day_order], 
            annot=True, 
            cmap='RdYlGn_r', # Take a colour pallet from https://loading.io/color/feature/RdYlGn-9/ and use _r to flip it so that red is higher and green is lower
            fmt='g')
plt.title(f"Heatmap of Fatal Accidents ({earliest_year} to {latest_year}) by Day of Week and Hour of Day")
plt.xlabel('Day of the Week')
plt.ylabel('Hour of Day')
plt.show()

Friday and Saturday afternoons stand out clearly — the 3pm Friday slot in particular. Late-night weekend hours also see elevated numbers.

Change over time in the number of fatal accidents by year

Code

# Creating a line graph of fatalities by year
fatalities_per_year = df.groupby('Year')['Crash ID'].size().reset_index()
fatalities_per_year.columns = ['Year', 'Fatalities']

# Creating a Line Plot
sns.set_style('whitegrid')
plt.figure(figsize=(12, 6))
sns.lineplot(x='Year', 
            y='Fatalities',
            data=fatalities_per_year,
            marker="o")
plt.title(f"Total Number of Fatalities by Year ({earliest_year} to {latest_year})")
plt.xlabel('Year')
plt.ylabel('Number of Fatalities')
plt.grid(True)
plt.show()

Road fatalities have declined substantially since 1989, but have risen from their historic low of around 1,100 in 2020–21 to levels not seen since 2016. Since Australia’s population has grown considerably over this period, raw counts alone can be misleading — a population-adjusted measure gives a more accurate picture of road safety progress.

Code

# We find population data for Australia from 1989 to 2021 from the United Nations website: https://population.un.org/dataportal/data/indicators/49/locations/36/start/1989/end/2021/table/pivotbylocation
# Using this data we create a dictionary with Year as the index. 

australian_population_data =  {
    'Year': list(range(1989, 2026)),
    'Population': [
        16796588, 17048003, 17271086, 17462504, 17631511, 17805504, 18003000,
        18211845, 18410250, 18601667, 18800892, 19017963, 19248143, 19475844,
        19698999, 19925056, 20171731, 20467030, 20830828, 21247873, 21660892,
        22019168, 22357034, 22729269, 23111782, 23469579, 23820236, 24195701,
        24590334, 24979230, 25357170, 25670051, 25921089, 26200985, 26451124,
        26713205, 26974026
    ]
}

# Creating a dataframe from the dictionary
population_df = pd.DataFrame(australian_population_data)

# Merge the population data with the fatalities data
merged_df = pd.merge(fatalities_per_year, population_df, on='Year', how='inner')

# Calculate the number of fatalities per 100,000 people
merged_df['Fatalities per 100k'] = (merged_df['Fatalities'] / merged_df['Population']) * 100000

# Line plot of fatalities per 100,000 people
sns.set_style('whitegrid')
plt.figure(figsize=(12, 6))
sns.lineplot(x='Year', 
            y='Fatalities per 100k', 
            data=merged_df,
            marker="o")
plt.title(f"Fatalities per 100,000 Population by Year ({earliest_year} to {latest_year})")
plt.xlabel('Year')
plt.ylabel('Fatalities per 100,000 Population')
plt.grid(True)
plt.show()

When adjusted for population growth, the picture is more nuanced. Fatalities per 100,000 people remain only slightly above the pre-pandemic low, reflecting the fact that the recent rise in raw numbers is partly explained by a larger population base. Population-adjusted figures are generally a more meaningful measure of road safety progress, and on that basis the recent increase, while worth monitoring, is considerably less dramatic than the raw numbers suggest.

Fatalities by gender over time

Code

# Grouping the number of fatalities each year by gender
gender_fatalities = df.groupby(['Year', 'Gender'])['Crash ID'].size().reset_index(name='Fatalities')

# Calculating total fatalities per year
total_fatalities_per_year = gender_fatalities.groupby('Year')['Fatalities'].sum().reset_index(name='Total Fatalities')

# Merging the dataframes and calculating proportions
gender_fatalities = pd.merge(gender_fatalities, total_fatalities_per_year, on='Year')
gender_fatalities['Proportion'] = (gender_fatalities['Fatalities'] / gender_fatalities['Total Fatalities']) * 100

# Pivoting the data for easier plotting
gender_proportions_pivot = gender_fatalities.pivot(index='Year', columns='Gender', values='Proportion')

# Plotting
sns.set_style('whitegrid')
plt.figure(figsize=(12, 6))
sns.lineplot(data=gender_proportions_pivot)
plt.title(f"Proportion of Fatalities by Gender Over Time ({earliest_year} to {latest_year})")
plt.xlabel('Year')
plt.ylabel('Proportion of Fatalities (%)')
plt.grid(color='grey', linestyle='--', linewidth=0.5, alpha=0.5)
plt.legend(title='Gender')
plt.show()

The male-to-female ratio has barely shifted since 1989 — this is a persistent structural pattern, not a recent phenomenon.

Fatalities by age group over time

Code

# Plotting
sns.set_style('whitegrid')
plt.figure(figsize=(12, 6))
sns.lineplot(data=age_group_fatalities_pivot, marker="o", linewidth=2.5) # Use the pivot table created earlier
plt.title(f"Fatalities Over Time by Age Group ({earliest_year} to {latest_year})")
plt.xlabel('Year')
plt.ylabel('Number of Fatalities')
plt.grid(color='gray', linestyle='--', linewidth=0.5)
plt.legend(title='Age Group')
plt.show()

Every age group has seen a decline, but the 17–25 cohort stands out — the drop is steeper than any other, likely reflecting graduated licensing reforms introduced across Australian states in the 1990s and 2000s. Around the same time, 40–64 year olds overtook younger drivers as the age group with the highest annual fatality count.

Fatalities by road user type over time

Code

# Grouping data by year and road user type, and counting fatalities
ped_and_cyclist_fatalities = df[df['Road User'].isin(['Pedal cyclist', 'Pedestrian'])]

# Group data by Year and Road User, counting all Crash ID occurrences
fatalities_by_year_user = ped_and_cyclist_fatalities.groupby(['Year', 'Road User'])['Crash ID'].count().reset_index(name='Fatalities')

# Merge fatalities data with population data
fatalities_with_pop = pd.merge(fatalities_by_year_user, population_df, on='Year', how='left')

# Calculate fatalities per 100,000 people
fatalities_with_pop['Fatalities per 100k'] = (fatalities_with_pop['Fatalities'] / fatalities_with_pop['Population']) * 100000

# Pivot the data
sns.set_style('whitegrid')
plt.figure(figsize=(12, 6))
sns.lineplot(x='Year', 
            y='Fatalities per 100k',
            data=fatalities_with_pop, 
            hue='Road User', 
            marker="o")
plt.title(f"Fatalities per 100,000 Population Over Time by Road User Type ({earliest_year} to {latest_year})")
plt.xlabel('Year')
plt.ylabel('Fatalities per 100,000 Population')
plt.grid(True)
sns.despine()
plt.tight_layout()
plt.show()

Pedestrian fatalities have fallen steadily since 1989. Cyclist deaths have barely moved since the mid-1990s — a gap explored further in the vulnerable road users notebook.

Part 4: Conclusion

This analysis mapped how fatal road crashes in Australia have changed from 1989 to 2025 across people, places, and time.

Key findings include: - A significant overall decline in fatalities, both in absolute terms and per 100,000 population. - The largest reduction occurred among 17–25-year-old drivers, potentially linked to licensing reforms targeting provisional licence holders. - Males consistently account for approximately 70% of road fatalities throughout the observed period. - Fatal accidents are most concentrated between Friday and Sunday, with Friday afternoons posing the highest risk. - New South Wales, Victoria, and Queensland collectively represent over 80% of national road fatalities. - Pedestrian fatalities have decreased over time, while cyclist fatalities have remained relatively stable since the mid-1990s.

Opportunities for Further Research

This dataset covers fatal crashes only. Adding non-fatal injury data would give a fuller view of road harm, not just the most severe outcomes.
It would also be useful to examine injury-to-fatality ratios for pedestrians and cyclists to better understand where safety gains are happening and where they are stalling.