The hypothetical HR department at the fictional Salifort Motors collected employee data to improve satisfaction. They requested data-driven suggestions based on an analysis of this data. The main question is: what factors are likely to make an employee leave the company?
The goal of this project is to analyze the data and build a model to predict employee attrition. By identifying which employees are likely to leave, it may be possible to determine the factors contributing to their departure. The model should be interpretable so HR can design targeted interventions to improve retention. Improving retention can reduce the costs associated with hiring and training new employees.
Stakeholders:
The primary stakeholder is the Human Resources (HR) department, as they will use the results to inform retention strategies. Secondary stakeholders include C-suite executives who oversee company direction, managers implementing day-to-day retention efforts, employees (whose experiences and outcomes are directly affected), and, indirectly, customers, since employee satisfaction can impact customer satisfaction.
Ethical Considerations:
The dataset contains 15,000 rows and 10 columns for the variables listed below.
Note: For more information about the data, refer to its source on Kaggle.
Variable | Description | |
---|---|---|
satisfaction_level | Employee-reported job satisfaction level [0–1] | |
last_evaluation | Score of employee's last performance review [0–1] | |
number_project | Number of projects employee contributes to | |
average_monthly_hours | Average number of hours employee worked per month | |
time_spend_company | How long the employee has been with the company (years) | |
Work_accident | Whether or not the employee experienced an accident while at work | |
left | Whether or not the employee left the company | |
promotion_last_5years | Whether or not the employee was promoted in the last 5 years | |
Department | The employee's department | |
salary | The employee's salary (U.S. dollars) |
Initial Data Observations:
Data Wrangling
During initial data exploration, several basic data cleaning steps were taken. Columns were renamed to standardized snake_case format for consistency and easier coding. I confirmed there were no missing values, reducing the risk of bias or errors. Outliers were explored but not removed at this stage; they will be addressed as needed during modeling.
Most importantly, there were 3,008 duplicate rows in the dataset. Since it is highly improbable for two employees to have identical responses across all columns, these duplicate entries were removed from the analysis.
Start with a look at the first few rows of data:
satisfaction_level | last_evaluation | number_project | average_montly_hours | time_spend_company | Work_accident | left | promotion_last_5years | Department | salary | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.38 | 0.53 | 2 | 157 | 3 | 0 | 1 | 0 | sales | low |
1 | 0.80 | 0.86 | 5 | 262 | 6 | 0 | 1 | 0 | sales | medium |
2 | 0.11 | 0.88 | 7 | 272 | 4 | 0 | 1 | 0 | sales | medium |
3 | 0.72 | 0.87 | 5 | 223 | 5 | 0 | 1 | 0 | sales | low |
4 | 0.37 | 0.52 | 2 | 159 | 3 | 0 | 1 | 0 | sales | low |
Below are descriptive statistics for numerical data, followed by categorical data (department and salary) and summary observations.
satisfaction_level | last_evaluation | number_project | average_montly_hours | time_spend_company | Work_accident | left | promotion_last_5years | |
---|---|---|---|---|---|---|---|---|
count | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 |
mean | 0.612834 | 0.716102 | 3.803054 | 201.050337 | 3.498233 | 0.144610 | 0.238083 | 0.021268 |
std | 0.248631 | 0.171169 | 1.232592 | 49.943099 | 1.460136 | 0.351719 | 0.425924 | 0.144281 |
min | 0.090000 | 0.360000 | 2.000000 | 96.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.440000 | 0.560000 | 3.000000 | 156.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 0.640000 | 0.720000 | 4.000000 | 200.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 |
75% | 0.820000 | 0.870000 | 5.000000 | 245.000000 | 4.000000 | 0.000000 | 0.000000 | 0.000000 |
max | 1.000000 | 1.000000 | 7.000000 | 310.000000 | 10.000000 | 1.000000 | 1.000000 | 1.000000 |
Department value counts and percentages:
Department | Count | Percent |
---|---|---|
Sales | 4140 | 27.60 |
Technical | 2720 | 18.13 |
Support | 2229 | 14.86 |
IT | 1227 | 8.18 |
Product Mng | 902 | 6.01 |
Marketing | 858 | 5.72 |
R and D | 787 | 5.25 |
Accounting | 767 | 5.11 |
HR | 739 | 4.93 |
Management | 630 | 4.20 |
Salary value counts and percentages:
salary | Count | Percent |
---|---|---|
Low | 7316 | 48.78 |
Medium | 6446 | 42.98 |
High | 1237 | 8.25 |
Summary:
The data shows a workforce with moderate satisfaction, generally high performance reviews, and a typical tenure of 3–4 years. Most employees have not been promoted recently, and workplace accidents are relatively uncommon. Most employees are in lower salary bands and concentrated in sales, technical, and support roles. There is a notable proportion of employees who have left. There are no extreme outliers, but a few employees have unusually long tenures or high monthly hours.
Long-term employees (those with more than five years at the company) are statistical outliers in this dataset. These cases will be excluded from logistic regression models during Model Construction & Validation because logistic regression is particularly sensitive to outliers.
Employees with exceptionally high average monthly hours or an unusually high or low number of projects may also be considered outliers. However, these cases are not easily detected in overall summary statistics or aggregate plots, as their impact is masked by the larger population of typical employees. Aggregate statistics can hide important subgroup patterns, which will be explored further during EDA. These subgroups will remain in the data for model construction, as they may provide valuable insights into attrition risk.
Number of tenure outliers: 824
Outliers percentage of total: 6.87%
Employees working on 3–4 projects generally stayed. Most groups worked more than a typical 40-hour workweek.
Attrition is highest at the 4–5 year mark, with a sharp drop-off in departures after 5 years. This suggests a critical window for retention efforts. Employees who make it past 5 years are much more likely to stay.
Both leavers and stayers tend to have similar evaluation scores. Notably, some employees with high evaluations still leave, often those who are overworked. This suggests that strong performance alone does not guarantee retention if other factors (like satisfaction or workload) are problematic.
Relationships Between Variables:
Distributions in the Data:
Ethical Considerations:
Note:
This data is clearly synthetic. It's too clean, and the clusters in the charts are much neater than what you’d see in real-world HR data.
Before starting, look at the distribution of employees who left versus those who stayed. The attrition breakdown is pretty normal for industry. But this imbalance matters when training a predictive model.
Count | Percent | |
---|---|---|
Stayed | 10000 | 83.40 |
Left | 1991 | 16.60 |
These are overview plots that provide a broad look at the data. After these, we’ll focus on individual features in more detail. The goal here is to give an initial sense of the dataset’s structure and key patterns.
The following pairplots show the relationships between features, with the diagonal displaying each feature’s distribution.
Boxplots summarize the overall distribution of each feature. However, as noted earlier, aggregate plots can sometimes hide important subgroups or outliers.
Violin plots are especially useful here, as they reveal the presence of distinct subgroups. For example, in satisfaction_level
, you can see the extremely miserable and somewhat dissatisfied employees, along with those who left for more typical reasons. In last_evaluation
and average_monthly_hours
, employees who left cluster at both extremes, while those who stayed are more evenly distributed. For number_project
, leavers are concentrated at both the low and high ends, and for tenure
, there is a noticeable spike in departures around the 4–5 year mark.
Finally, we include histograms for each feature, first normalized, to compare proportions between leavers and stayers...
...and then as raw counts:
There are two prominent clusters among employees who left: one group with very low satisfaction who worked long hours, and another group who worked fewer than 40 hours per week and reported moderate dissatisfaction. All of the employees who worked the longest hours left.
The pattern is similar among employees who left: those with very low satisfaction often received high evaluations, while those with moderate dissatisfaction tended to have relatively low evaluation scores.
Mean | Median | |
---|---|---|
Stayed | 0.667 | 0.690 |
Left | 0.440 | 0.410 |
Above: Employees who left were, on average, 22.7% less satisfied (mean) and 28% less satisfied (median) than those who stayed.
Below: Working long hours does not guarantee a high evaluation, nor does a strong evaluation ensure a reasonable workload. Among leavers, the percentage of highly evaluated employees working long hours closely mirrors the pattern seen in satisfaction levels. Many top performers are being driven away.
Tenure | Left | Count | Percent |
---|---|---|---|
2 | Stayed | 2879 | 98.93 |
2 | Left | 31 | 1.07 |
3 | Stayed | 4316 | 83.16 |
3 | Left | 874 | 16.84 |
4 | Stayed | 1510 | 75.31 |
4 | Left | 495 | 24.69 |
5 | Stayed | 580 | 54.61 |
5 | Left | 482 | 45.39 |
6 | Stayed | 433 | 79.89 |
6 | Left | 109 | 20.11 |
7 | Stayed | 94 | 100.00 |
7 | Left | 0 | 0.00 |
8 | Stayed | 81 | 100.00 |
8 | Left | 0 | 0.00 |
10 | Stayed | 107 | 100.00 |
10 | Left | 0 | 0.00 |
A band of employees with low satisfaction is especially evident at four years of tenure.
There is a clear grouping of leavers who consistently worked long hours (i.e., many in excess of a 60-hour work week). In fact, most employees at this company work above a standard 40-hour work week.
The number of projects is a strong predictor of attrition. Employees at both the low and high extremes are more likely to leave, and notably, all employees with 7 projects left the company.
Number of Projects | Left | Count | Percent |
---|---|---|---|
2 | Left | 857 | 54.17 |
2 | Stayed | 725 | 45.83 |
3 | Left | 38 | 1.08 |
3 | Stayed | 3482 | 98.92 |
4 | Left | 237 | 6.43 |
4 | Stayed | 3448 | 93.57 |
5 | Left | 343 | 15.36 |
5 | Stayed | 1890 | 84.64 |
6 | Left | 371 | 44.92 |
6 | Stayed | 455 | 55.08 |
7 | Left | 145 | 100.00 |
7 | Stayed | 0 | 0.00 |
There are no notable outliers among employees who stayed. Among those who left, both overworked and underworked patterns are evident, along with a group who appear to have left for more typical reasons. For leavers with many projects (6 or 7), the IQR for monthly hours was very high, entirely above 240 hours/month. In fact, the IQR for almost every group was above a typical 40-hour work week (~167 hours / month). Interestingly, a few employees with 7 projects reported unusually low monthly hours, which may reflect data anomalies or unique circumstances.
Among employees who left, dissatisfaction is most evident for those assigned a very high number of projects. Conversely, those with fewer projects also show signs of lower satisfaction, possibly indicating disengagement.
There are no clear patterns linking number of projects, salary, and attrition. However, the relatively small group of high-salaried employees tends to fall in the middle range for number of projects.
Salary does not show a discernible relationship with attrition; the 'high' salary group is much smaller than the others by an order of magnitude, limiting its impact on overall trends.
Promotions were rare and, notably, all of the employees with the highest workload left.
Salary | Left | Count | Percent |
---|---|---|---|
High | Left | 48 | 4.85 |
High | Stayed | 942 | 95.15 |
Low | Left | 1174 | 20.45 |
Low | Stayed | 4566 | 79.55 |
Medium | Left | 769 | 14.62 |
Medium | Stayed | 4492 | 85.38 |
Promotion Last 5 Years | Left | Count | Percent |
---|---|---|---|
No | Left | 1983 | 16.82 |
No | Stayed | 9805 | 83.18 |
Yes | Left | 8 | 3.94 |
Yes | Stayed | 195 | 96.06 |
No | Total | 11788 | 98.31 |
Yes | Total | 203 | 1.69 |
Somewhat unexpectedly, having a work accident is associated with a lower likelihood of leaving. This could suggest that employees who experience an accident may receive increased support or attention from HR or the company, which encourages them to stay. However, this association could also be coincidental.
Work Accident | Left | Count | Percent |
---|---|---|---|
No | Left | 1886 | 18.60 |
No | Stayed | 8255 | 81.40 |
Yes | Left | 105 | 5.68 |
Yes | Stayed | 1745 | 94.32 |
Department-level attrition closely matches the overall stay/leave split (83%/17%) and company-wide satisfaction levels, suggesting department itself is not a major factor. More granular data (e.g., by manager or team) might uncover specific problem areas, but nothing stands out in the current breakdown.
Department | Left | Count | Percent |
---|---|---|---|
IT | Left | 158 | 16.19 |
IT | Stayed | 818 | 83.81 |
R and D | Left | 85 | 12.25 |
R and D | Stayed | 609 | 87.75 |
Accounting | Left | 109 | 17.55 |
Accounting | Stayed | 512 | 82.45 |
HR | Left | 113 | 18.80 |
HR | Stayed | 488 | 81.20 |
Management | Left | 52 | 11.93 |
Management | Stayed | 384 | 88.07 |
Marketing | Left | 112 | 16.64 |
Marketing | Stayed | 561 | 83.36 |
Product Mng | Left | 110 | 16.03 |
Product Mng | Stayed | 576 | 83.97 |
Sales | Left | 550 | 16.98 |
Sales | Stayed | 2689 | 83.02 |
Support | Left | 312 | 17.13 |
Support | Stayed | 1509 | 82.87 |
Technical | Left | 390 | 17.38 |
Technical | Stayed | 1854 | 82.62 |
The correlation matrix shows moderate correlation between some variables. Employee attrition (leaving) is most strongly and negatively correlated with satisfaction level, indicating that less satisfied employees are more likely to leave. There are moderate positive correlations between average monthly hours, last evaluation, and number of projects, as well as a moderate association between tenure and attrition.
The data suggests significant challenges with employee retention at this company. Two main groups of leavers emerge:
A majority of the workforce greatly exceeds the typical 40-hour work week (160–184 hours per month), pointing to a workplace culture that expects long hours. The combination of high workload and limited opportunities for advancement likely fuels dissatisfaction and increases the risk of turnover.
Performance evaluations show only a weak link to attrition; both those who left and those who stayed received similar review scores. This indicates that strong performance alone does not guarantee retention, especially if employees are overworked or lack opportunities for growth.
Other variables, such as department, salary, and work accidents, do not show strong predictive value for employee churn compared to satisfaction and workload. Overall, the data points to issues with workload management and limited career progression as the main factors driving employee turnover at this company.