Classified: At-Risk – Salifort Churn Prediction

Home
  • Executive Summary
  • Model Construction & Validation
  • Reference: Model Development
Classified: At-Risk Salifort Employee Churn Prediction

Exploratory Data Analysis


Understanding employee turnover at Salifort Motors¶

Bryan Johns

June 2025

Table of Contents¶

  • Description and Deliverables
  • Data Dictionary
  • Descriptive Statistics
    • Outliers
  • Exploratory Data Analysis
    • Overview Plots
    • Satisfaction Levels
    • Tenure
    • Number of Projects
    • Salary & Promotions
    • Work Accident
    • Department
    • Correlation Matrix
  • Insights

Description and Deliverables¶


Back to top

The hypothetical HR department at the fictional Salifort Motors collected employee data to improve satisfaction. They requested data-driven suggestions based on an analysis of this data. The main question is: what factors are likely to make an employee leave the company?

The goal of this project is to analyze the data and build a model to predict employee attrition. By identifying which employees are likely to leave, it may be possible to determine the factors contributing to their departure. The model should be interpretable so HR can design targeted interventions to improve retention. Improving retention can reduce the costs associated with hiring and training new employees.

Stakeholders:
The primary stakeholder is the Human Resources (HR) department, as they will use the results to inform retention strategies. Secondary stakeholders include C-suite executives who oversee company direction, managers implementing day-to-day retention efforts, employees (whose experiences and outcomes are directly affected), and, indirectly, customers, since employee satisfaction can impact customer satisfaction.

Ethical Considerations:

  • Ensure employee data privacy and confidentiality throughout the analysis.
  • Avoid introducing or perpetuating bias in model predictions (e.g., not unfairly targeting specific groups).
  • Maintain transparency in how predictions are generated and how they will be used in HR decision-making.

This page summarizes the first part of the project: exploratory data analysis.¶

Data Dictionary¶


Back to top

The dataset contains 15,000 rows and 10 columns for the variables listed below.

Note: For more information about the data, refer to its source on Kaggle.

Variable Description
satisfaction_level Employee-reported job satisfaction level [0–1]
last_evaluation Score of employee's last performance review [0–1]
number_project Number of projects employee contributes to
average_monthly_hours Average number of hours employee worked per month
time_spend_company How long the employee has been with the company (years)
Work_accident Whether or not the employee experienced an accident while at work
left Whether or not the employee left the company
promotion_last_5years Whether or not the employee was promoted in the last 5 years
Department The employee's department
salary The employee's salary (U.S. dollars)

Descriptive Statistics¶


Back to top

Initial Data Observations:

  • The workforce displays moderate satisfaction and generally high performance reviews.
  • Typical tenure is 3–4 years, with most employees (98%) not promoted recently.
  • Workplace accidents are relatively rare (14%).
  • Most employees are in lower salary bands and concentrated in sales, technical, and support roles.
  • About 24% of employees have left the company.
  • No extreme outliers, though a few employees have unusually long tenures or high monthly hours.

Data Wrangling

During initial data exploration, several basic data cleaning steps were taken. Columns were renamed to standardized snake_case format for consistency and easier coding. I confirmed there were no missing values, reducing the risk of bias or errors. Outliers were explored but not removed at this stage; they will be addressed as needed during modeling.

Most importantly, there were 3,008 duplicate rows in the dataset. Since it is highly improbable for two employees to have identical responses across all columns, these duplicate entries were removed from the analysis.

Start with a look at the first few rows of data:

satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years Department salary
0 0.38 0.53 2 157 3 0 1 0 sales low
1 0.80 0.86 5 262 6 0 1 0 sales medium
2 0.11 0.88 7 272 4 0 1 0 sales medium
3 0.72 0.87 5 223 5 0 1 0 sales low
4 0.37 0.52 2 159 3 0 1 0 sales low

Below are descriptive statistics for numerical data, followed by categorical data (department and salary) and summary observations.

satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years
count 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000
mean 0.612834 0.716102 3.803054 201.050337 3.498233 0.144610 0.238083 0.021268
std 0.248631 0.171169 1.232592 49.943099 1.460136 0.351719 0.425924 0.144281
min 0.090000 0.360000 2.000000 96.000000 2.000000 0.000000 0.000000 0.000000
25% 0.440000 0.560000 3.000000 156.000000 3.000000 0.000000 0.000000 0.000000
50% 0.640000 0.720000 4.000000 200.000000 3.000000 0.000000 0.000000 0.000000
75% 0.820000 0.870000 5.000000 245.000000 4.000000 0.000000 0.000000 0.000000
max 1.000000 1.000000 7.000000 310.000000 10.000000 1.000000 1.000000 1.000000

Department value counts and percentages:

Department Count Percent
Sales 4140 27.60
Technical 2720 18.13
Support 2229 14.86
IT 1227 8.18
Product Mng 902 6.01
Marketing 858 5.72
R and D 787 5.25
Accounting 767 5.11
HR 739 4.93
Management 630 4.20

Salary value counts and percentages:

salary Count Percent
Low 7316 48.78
Medium 6446 42.98
High 1237 8.25

Observations from descriptive statistics¶

  • satisfaction_level: Employee job satisfaction scores range from 0.09 to 1.0, with an average of about 0.61. The distribution is fairly wide (std ≈ 0.25), suggesting a mix of satisfied and dissatisfied employees.
  • last_evaluation: Performance review scores are generally high (mean ≈ 0.72), ranging from 0.36 to 1.0, with most employees scoring above 0.56.
  • number_project: Employees typically work on 2 to 7 projects, with a median of 4 projects.
  • average_monthly_hours: The average employee works about 201 hours per month, with a range from 96 to 310 hours, indicating some employees work significantly more than others.
  • time_spend_company: Most employees have been with the company for 2 to 10 years, with a median of 3 years. There are a few long-tenure employees (up to 10 years), but most are around 3–4 years.
  • Work_accident: About 14% of employees have experienced a workplace accident.
  • left: About 24% of employees have left the company (mean ≈ 0.24), so roughly one in four employees in the dataset is a leaver.
  • promotion_last_5years: Very few employees (about 2%) have been promoted in the last five years.
  • department: The largest departments are sales, technical, and support, which together account for over half of the workforce. Other departments are notably smaller.
  • salary: Most employees are in the low (49%) or medium (43%) salary bands, with only a small proportion (8%) in the high salary band.

Summary:
The data shows a workforce with moderate satisfaction, generally high performance reviews, and a typical tenure of 3–4 years. Most employees have not been promoted recently, and workplace accidents are relatively uncommon. Most employees are in lower salary bands and concentrated in sales, technical, and support roles. There is a notable proportion of employees who have left. There are no extreme outliers, but a few employees have unusually long tenures or high monthly hours.

Outliers¶


Back to top

Long-term employees (those with more than five years at the company) are statistical outliers in this dataset. These cases will be excluded from logistic regression models during Model Construction & Validation because logistic regression is particularly sensitive to outliers.

Employees with exceptionally high average monthly hours or an unusually high or low number of projects may also be considered outliers. However, these cases are not easily detected in overall summary statistics or aggregate plots, as their impact is masked by the larger population of typical employees. Aggregate statistics can hide important subgroup patterns, which will be explored further during EDA. These subgroups will remain in the data for model construction, as they may provide valuable insights into attrition risk.

Number of tenure outliers: 824

Outliers percentage of total: 6.87%

Exploratory Data Analysis¶


Back to top

Two major groups of employees left the company:

  • Overworked and Miserable: These employees had low satisfaction but were assigned a high number of projects (6–7) and worked 250–300 hours per month. Notably, 100% of employees with 7 projects left.

  • Underworked and Dissatisfied: These employees had low satisfaction and worked fewer hours and projects. They may have been fired. Alternately, they may have given notice or had already mentally checked out and were assigned less work.

Employees working on 3–4 projects generally stayed. Most groups worked more than a typical 40-hour workweek.

Attrition is highest at the 4–5 year mark, with a sharp drop-off in departures after 5 years. This suggests a critical window for retention efforts. Employees who make it past 5 years are much more likely to stay.

Both leavers and stayers tend to have similar evaluation scores. Notably, some employees with high evaluations still leave, often those who are overworked. This suggests that strong performance alone does not guarantee retention if other factors (like satisfaction or workload) are problematic.

Relationships Between Variables:

  • Satisfaction level is the strongest predictor of attrition. Employees who left had much lower satisfaction than those who stayed.
  • Number of projects and average monthly hours show a non-linear relationship: both underworked and overworked employees are more likely to leave, while those with a moderate workload tend to stay.
  • Employee evaluation (last performance review) has a weaker relationship with attrition compared to satisfaction or workload.
  • Tenure shows a moderate relationship with attrition: employees are most likely to leave at the 4–5 year mark, with departures dropping sharply after 5 years.
  • Promotion in the last 5 years is rare, and lack of promotion is associated with higher attrition.
  • Department and salary have only minor effects on attrition compared to satisfaction and workload.
  • Work accidents are slightly associated with lower attrition, possibly due to increased support after an incident.

Distributions in the Data:

  • Most variables (satisfaction, evaluation, monthly hours) are broadly distributed, with some skewness.
  • Tenure is concentrated around 3–4 years, with few employees beyond 5 years.
  • Number of projects is typically 3–4, but a small group has 6–7 projects (most of whom left).
  • Salary is heavily skewed toward low and medium bands.
  • There are no extreme outliers, but a few employees have unusually high tenure or monthly hours.

Ethical Considerations:

  • Ensure employee data privacy and confidentiality.
  • Avoid introducing or perpetuating bias in analysis or modeling.
  • Be transparent about how findings and predictions will be used.
  • Consider the impact of recommendations on employee well-being and fairness.

Note:
This data is clearly synthetic. It's too clean, and the clusters in the charts are much neater than what you’d see in real-world HR data.

Before starting, look at the distribution of employees who left versus those who stayed. The attrition breakdown is pretty normal for industry. But this imbalance matters when training a predictive model.

Count Percent
Stayed 10000 83.40
Left 1991 16.60

Overview Plots¶


Back to top

Visualizing Feature Distributions and Relationships by Employee Attrition¶

These are overview plots that provide a broad look at the data. After these, we’ll focus on individual features in more detail. The goal here is to give an initial sense of the dataset’s structure and key patterns.

The following pairplots show the relationships between features, with the diagonal displaying each feature’s distribution.

Boxplots summarize the overall distribution of each feature. However, as noted earlier, aggregate plots can sometimes hide important subgroups or outliers.

Violin plots are especially useful here, as they reveal the presence of distinct subgroups. For example, in satisfaction_level, you can see the extremely miserable and somewhat dissatisfied employees, along with those who left for more typical reasons. In last_evaluation and average_monthly_hours, employees who left cluster at both extremes, while those who stayed are more evenly distributed. For number_project, leavers are concentrated at both the low and high ends, and for tenure, there is a noticeable spike in departures around the 4–5 year mark.

Finally, we include histograms for each feature, first normalized, to compare proportions between leavers and stayers...

...and then as raw counts:

Satisfaction Levels¶


Back to top

These visualizations dramatically illustrate the two main clusters of employees who left.¶

There are two prominent clusters among employees who left: one group with very low satisfaction who worked long hours, and another group who worked fewer than 40 hours per week and reported moderate dissatisfaction. All of the employees who worked the longest hours left.

The pattern is similar among employees who left: those with very low satisfaction often received high evaluations, while those with moderate dissatisfaction tended to have relatively low evaluation scores.

Mean Median
Stayed 0.667 0.690
Left 0.440 0.410

Above: Employees who left were, on average, 22.7% less satisfied (mean) and 28% less satisfied (median) than those who stayed.

Below: Working long hours does not guarantee a high evaluation, nor does a strong evaluation ensure a reasonable workload. Among leavers, the percentage of highly evaluated employees working long hours closely mirrors the pattern seen in satisfaction levels. Many top performers are being driven away.

Tenure¶


Back to top

Employees especially quit at the 4 and 5 year mark. Almost nobody quits after 5 years.

Tenure Left Count Percent
2 Stayed 2879 98.93
2 Left 31 1.07
3 Stayed 4316 83.16
3 Left 874 16.84
4 Stayed 1510 75.31
4 Left 495 24.69
5 Stayed 580 54.61
5 Left 482 45.39
6 Stayed 433 79.89
6 Left 109 20.11
7 Stayed 94 100.00
7 Left 0 0.00
8 Stayed 81 100.00
8 Left 0 0.00
10 Stayed 107 100.00
10 Left 0 0.00

A band of employees with low satisfaction is especially evident at four years of tenure.

There is a clear grouping of leavers who consistently worked long hours (i.e., many in excess of a 60-hour work week). In fact, most employees at this company work above a standard 40-hour work week.

Number of Projects¶


Back to top

The number of projects is a strong predictor of attrition. Employees at both the low and high extremes are more likely to leave, and notably, all employees with 7 projects left the company.

Number of Projects Left Count Percent
2 Left 857 54.17
2 Stayed 725 45.83
3 Left 38 1.08
3 Stayed 3482 98.92
4 Left 237 6.43
4 Stayed 3448 93.57
5 Left 343 15.36
5 Stayed 1890 84.64
6 Left 371 44.92
6 Stayed 455 55.08
7 Left 145 100.00
7 Stayed 0 0.00

There are no notable outliers among employees who stayed. Among those who left, both overworked and underworked patterns are evident, along with a group who appear to have left for more typical reasons. For leavers with many projects (6 or 7), the IQR for monthly hours was very high, entirely above 240 hours/month. In fact, the IQR for almost every group was above a typical 40-hour work week (~167 hours / month). Interestingly, a few employees with 7 projects reported unusually low monthly hours, which may reflect data anomalies or unique circumstances.

Among employees who left, dissatisfaction is most evident for those assigned a very high number of projects. Conversely, those with fewer projects also show signs of lower satisfaction, possibly indicating disengagement.

There are no clear patterns linking number of projects, salary, and attrition. However, the relatively small group of high-salaried employees tends to fall in the middle range for number of projects.

Salary & Promotions¶


Back to top

Salary does not show a discernible relationship with attrition; the 'high' salary group is much smaller than the others by an order of magnitude, limiting its impact on overall trends.

Promotions were rare and, notably, all of the employees with the highest workload left.

Salary Left Count Percent
High Left 48 4.85
High Stayed 942 95.15
Low Left 1174 20.45
Low Stayed 4566 79.55
Medium Left 769 14.62
Medium Stayed 4492 85.38
Promotion Last 5 Years Left Count Percent
No Left 1983 16.82
No Stayed 9805 83.18
Yes Left 8 3.94
Yes Stayed 195 96.06
No Total 11788 98.31
Yes Total 203 1.69

Work Accident¶


Back to top

Somewhat unexpectedly, having a work accident is associated with a lower likelihood of leaving. This could suggest that employees who experience an accident may receive increased support or attention from HR or the company, which encourages them to stay. However, this association could also be coincidental.

Work Accident Left Count Percent
No Left 1886 18.60
No Stayed 8255 81.40
Yes Left 105 5.68
Yes Stayed 1745 94.32

Department¶


Back to top

Department-level attrition closely matches the overall stay/leave split (83%/17%) and company-wide satisfaction levels, suggesting department itself is not a major factor. More granular data (e.g., by manager or team) might uncover specific problem areas, but nothing stands out in the current breakdown.

Department Left Count Percent
IT Left 158 16.19
IT Stayed 818 83.81
R and D Left 85 12.25
R and D Stayed 609 87.75
Accounting Left 109 17.55
Accounting Stayed 512 82.45
HR Left 113 18.80
HR Stayed 488 81.20
Management Left 52 11.93
Management Stayed 384 88.07
Marketing Left 112 16.64
Marketing Stayed 561 83.36
Product Mng Left 110 16.03
Product Mng Stayed 576 83.97
Sales Left 550 16.98
Sales Stayed 2689 83.02
Support Left 312 17.13
Support Stayed 1509 82.87
Technical Left 390 17.38
Technical Stayed 1854 82.62

Correlation Matrix¶


Back to top

The correlation matrix shows moderate correlation between some variables. Employee attrition (leaving) is most strongly and negatively correlated with satisfaction level, indicating that less satisfied employees are more likely to leave. There are moderate positive correlations between average monthly hours, last evaluation, and number of projects, as well as a moderate association between tenure and attrition.

Insights¶


Back to Top

The data suggests significant challenges with employee retention at this company. Two main groups of leavers emerge:

  • Underutilized and Dissatisfied: Employees in this category worked on fewer projects and logged fewer hours than a typical full-time schedule, and reported lower satisfaction. These individuals may have been disengaged, assigned less work as they prepared to leave, or potentially subject to layoffs or terminations.
  • Overworked and Burned Out: The second group managed a high number of projects (up to 7) and worked exceptionally long hours, sometimes nearing 80 hours per week. These employees exhibited very low satisfaction and rarely received promotions, suggesting that high demands without recognition or advancement led to burnout and resignation.

A majority of the workforce greatly exceeds the typical 40-hour work week (160–184 hours per month), pointing to a workplace culture that expects long hours. The combination of high workload and limited opportunities for advancement likely fuels dissatisfaction and increases the risk of turnover.

Performance evaluations show only a weak link to attrition; both those who left and those who stayed received similar review scores. This indicates that strong performance alone does not guarantee retention, especially if employees are overworked or lack opportunities for growth.

Other variables, such as department, salary, and work accidents, do not show strong predictive value for employee churn compared to satisfaction and workload. Overall, the data points to issues with workload management and limited career progression as the main factors driving employee turnover at this company.

Continue Exploring:

Executive Summary
Model Construction

© 2025 Bryan C. Johns Portfolio LinkedIn GitHub