Introduction

Project Goal

The primary goal of this project is to develop a predictive maintenance model for Swire Coca-Cola’s production plants that accurately forecasts machine downtimes. This model will help the company identify the reasons behind machine breakdowns, optimize repair processes, and ultimately reduce the financial losses associated with unplanned downtimes.

Business and Analytic Problems

Swire Coca-Cola currently faces significant challenges due to the mechanical limitations of its production plants. Key issues include:

Unplanned Downtimes: Approximately 5.6% of production cases are lost due to unforeseen mechanical issues, resulting in annual losses of about $60 million.
Inadequate Predictive Insights: The existing approach relies on reactive measures after a breakdown occurs, making it difficult to anticipate issues before they impact production.
Data Utilization: There is a need to analyze the downtime records stored in the Internal Warehouse Controller (IWC) system, which only captures severe breakdowns, limiting the insights available for predicting future issues.

Purpose of the Notebook

The purpose of this notebook is to provide a detailed framework for the modeling process aimed at predicting machine downtimes at Swire Coca-Cola. This includes data preparation steps such as variable transformations, feature engineering, and discussion of missing values. Based on the open-ended nature of the business problem, we will explore the modeling task from a number of different angles.

Survival Analysis

The bulk of the notebook will be dedicated to a survival analysis model of time between machine failures. This analysis will explore the time-to-failure patterns of various equipment types across different functional areas in a production facility. The goal is to identify high-risk equipment types and functional areas that may benefit from preventive maintenance interventions. We will use the Kaplan-Meier estimator and the Cox Proportional Hazards model to estimate survival probabilities and examine covariate effects on equipment longevity. Specifically, the Kaplan-Meier estimator will provide insights into survival probability trends over time and estimate how long equipment lasts before failure. Additionally, we will create survival curves based on equipment type and equipment ID to investigate survival patterns across different categories. A Cox Proportional Hazards model will then be applied to examine how specific equipment types and plant locations influence failure risk.

Quarterly Forecasting

Next we will build and evaluate a predictive model aimed at forecasting the number of maintenance breakdowns that will occur in each quarter. By leveraging historical data on breakdowns and maintenance orders, and various operational factors, we will identify key features that drive breakdown trends and develop a model to predict future maintenance needs. Using Random Forest regression, we’ll train a model and assess its performance based on common metrics like Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared. We’ll also explore how features influence the predictions, including how past breakdowns, functional areas, and seasonal trends contribute to the accuracy of the forecast. Finally, we’ll refine the model with log transformations and polynomial features, and evaluate the results to identify areas for further improvement. The ultimate goal is to create a robust model that can be used for proactive maintenance planning, ensuring that resources are allocated efficiently to prevent equipment failures.

Work Order Duration Model

Finally, we will attempt to predict ACTUAL_WORK_IN_MINUTES, which represents the time required to complete maintenance work orders, using various input features. The goal is to develop a model that can forecast how long each work order will take, helping Swire predict and manage disruptions in production more effectively. Given the dataset’s size and the computational constraints, we will begin by testing simpler models to establish a baseline. The first model will be a linear regression, which will help us understand how well a straightforward approach performs with the given features. Next, we will use a random forest to evaluate if a more complex, non-linear model improves prediction accuracy. Finally, we will create a regression tree that incorporates the text data features derived in the earlier data preparation stages, testing its potential for capturing patterns in unstructured data. Each model will be evaluated using basic hold-out cross-validation by comparing metrics such as RMSE and MAE on both the training and test datasets.

Within the notebook, we will provide a summary of insights gained from the modeling process, highlight key features and discuss the impact of different techniques.

Data Preparation

We will start by loading the necessary libraries and reading the dataset from a csv file.

# libraries
library(tidyverse)
library(rpart)
library(rpart.plot)
library(skimr)
library(janitor)
library(corrplot)
library(lubridate)
library(survival)
library(survminer)
library(readr)
library(VIM)
library(zoo)
library(randomForest)
library(caret)  
library(knitr)
library(kableExtra)
library(gridExtra)
library(Metrics)
library(tm)
library(pls)
library(e1071)

# read data
data <- read_csv("IWC_Work_Orders_Extract.csv")

An important aspect of modeling is considering how to handle missing data. Before proceeding, let’s see the percent of missing values there are in each column.

# calculate the percentage of missing data for each column
missing_data <- data %>%
  summarise(across(everything(), ~ mean(is.na(.)) * 100)) %>%
  pivot_longer(cols = everything(), names_to = "Column", values_to = "Percent Missing") %>%
  arrange(desc(`Percent Missing`))

# Print the result
print(missing_data)

## # A tibble: 25 × 2
##    Column                          `Percent Missing`
##    <chr>                                       <dbl>
##  1 FUNCTIONAL_AREA_NODE_5_MODIFIED              99.8
##  2 EQUIPMENT_DESC                               94.4
##  3 EQUIP_CAT_DESC                               94.4
##  4 EQUIP_START_UP_DATE                          94.4
##  5 EQUIP_VALID_FROM                             94.4
##  6 EQUIP_VALID_TO                               94.4
##  7 MAINTENANCE_PLAN                             89.3
##  8 MAINTENANCE_ITEM                             89.3
##  9 FUNCTIONAL_AREA_NODE_4_MODIFIED              83.4
## 10 FUNCTIONAL_AREA_NODE_3_MODIFIED              81.9
## # ℹ 15 more rows

As we can see, 16 out of the 25 columns contain more than 75% NAs. This presents a difficult challenge to overcome. Some of the options we considered include dropping the columns, imputing with a default value, using algorithms that can handle missing data automatically, and segmenting the data. Our goal is for the models to remain robust and reliable so they can still be used effectively on new data. Ultimately, the data preparation decisions were based on the goals and limitations of each of the models.

Data Cleaning for Survival Analysis

To prepare the data for our survival analysis model, we (1) replace blank strings with NAs in the text columns, (2) filter out rows with missing values in the FUNCTIONAL_AREA_NODE_2_MODIFIED column as these are crucial for the analysis, (3) convert dates, times, and categorical variables to the proper format, (4) create an indicator variable for unplanned work orders, (5) calculate time to failure and time since installation, and (6) replace NA values with 0 then adjust time to failure to have a minimum of 1.

# data prep for survival analysis
data_sa <- read.csv("IWC_Work_Orders_Extract.csv") %>%
  # replace blank strings with NA in character columns only
  mutate(across(where(is.character), ~na_if(.x, ""))) %>%
  # drop rows with missing values in EQUIPMENT_ID and MAINTENANCE_ITEM
  filter(!is.na(FUNCTIONAL_AREA_NODE_2_MODIFIED)) %>%
  # convert date and time columns to proper type
  mutate(
    EXECUTION_START_DATE = as.Date(EXECUTION_START_DATE, format = "%Y-%m-%d"),
    EXECUTION_FINISH_DATE = as.Date(EXECUTION_FINISH_DATE, format = "%Y-%m-%d"),
    EQUIP_START_UP_DATE = as.Date(EQUIP_START_UP_DATE, format = "%Y-%m-%d"),
    EQUIP_VALID_FROM = as.Date(EQUIP_VALID_FROM, format = "%Y-%m-%d"),
    EQUIP_VALID_TO = as.Date(EQUIP_VALID_TO, format = "%Y-%m-%d"),
    ACTUAL_START_TIME = hms(ACTUAL_START_TIME),
    ACTUAL_FINISH_TIME = hms(ACTUAL_FINISH_TIME)
  ) %>%
  # convert non-numeric and non-date type variables to factor
  mutate(
    PLANT_ID = factor(PLANT_ID),
    PRODUCTION_LOCATION = factor(PRODUCTION_LOCATION),
    MAINTENANCE_ACTIVITY_TYPE = factor(MAINTENANCE_ACTIVITY_TYPE),
    MAINTENANCE_TYPE_DESCRIPTION = factor(MAINTENANCE_TYPE_DESCRIPTION),
    FUNCTIONAL_AREA_NODE_1_MODIFIED = factor(FUNCTIONAL_AREA_NODE_1_MODIFIED),
    FUNCTIONAL_AREA_NODE_2_MODIFIED = factor(FUNCTIONAL_AREA_NODE_2_MODIFIED),
    FUNCTIONAL_AREA_NODE_3_MODIFIED = factor(FUNCTIONAL_AREA_NODE_3_MODIFIED),
    FUNCTIONAL_AREA_NODE_4_MODIFIED = factor(FUNCTIONAL_AREA_NODE_4_MODIFIED),
    FUNCTIONAL_AREA_NODE_5_MODIFIED = factor(FUNCTIONAL_AREA_NODE_5_MODIFIED),
    EQUIP_CAT_DESC = factor(EQUIP_CAT_DESC),
    MAINTENANCE_PLAN = factor(MAINTENANCE_PLAN),
    EQUIPMENT_ID = factor(EQUIPMENT_ID),
    MAINTENANCE_ITEM = factor(MAINTENANCE_ITEM)
  ) %>%
  # create failure indicator (1 = failure, 0 = no failure)
  mutate(failure = ifelse(MAINTENANCE_ACTIVITY_TYPE == "Unplanned", 1, 0)) %>%
  # sort data by FUNCTIONAL_AREA_NODE_2_MODIFIED and EXECUTION_START_DATE
  arrange(FUNCTIONAL_AREA_NODE_2_MODIFIED, EXECUTION_START_DATE) %>%
  # calculate time_to_failure and time_since_installation
  group_by(FUNCTIONAL_AREA_NODE_2_MODIFIED) %>%
  mutate(
    time_to_failure = ifelse(failure == 1, as.numeric(difftime(EXECUTION_START_DATE, lag(EXECUTION_START_DATE), units = "days")), NA),
    time_since_installation = as.numeric(difftime(EXECUTION_START_DATE, EQUIP_VALID_FROM, units = "days"))
  ) %>%
  fill(time_to_failure, .direction = "down") %>%
  ungroup() %>%
  # replace NA values with 0, then adjust time_to_failure to have a minimum of 1
  mutate(
    time_to_failure = replace_na(time_to_failure, 0),
    time_to_failure = ifelse(time_to_failure == 0, 1, time_to_failure),
    time_since_installation = replace_na(time_since_installation, 0)
  )

Data Cleaning for Quarterly Predictive Model

To prepare the data for our quarterly predictive model, we extract the year and quater from the start date variable, group by quarter and functional area to calculate number of records, average repair duration, and proportion of unplanned orders, calculate lagged features and moving averages, replace missing values with 0 where appropriate, and one-hot encode categorical variables.

# load data
data = read.csv("IWC_Work_Orders_Extract.csv")

# Convert character dates to Date format
data$EXECUTION_START_DATE <- as.Date(data$EXECUTION_START_DATE)
data$EXECUTION_FINISH_DATE <- as.Date(data$EXECUTION_FINISH_DATE)

# Extract year and quarter from EXECUTION_START_DATE
data$execution_year <- year(data$EXECUTION_START_DATE)
data$execution_quarter <- quarter(data$EXECUTION_START_DATE)


# extract year and quarter
data <- data %>%
  mutate(
    execution_year = year(EXECUTION_START_DATE),
    execution_quarter = quarter(EXECUTION_START_DATE)
  )

# summarize data by quarter and functional area
data_model <- data %>%
  group_by(FUNCTIONAL_AREA_NODE_1_MODIFIED, execution_year, execution_quarter) %>%
  summarize(
    records = n(),
    avg_repair_duration = mean(ACTUAL_WORK_IN_MINUTES, na.rm = TRUE),
    unplanned_maintenance_prop = mean(MAINTENANCE_ACTIVITY_TYPE == "Unplanned", na.rm = TRUE)
  ) %>%
  ungroup()

# add lagged features and moving averages
data_model <- data_model %>%
  arrange(FUNCTIONAL_AREA_NODE_1_MODIFIED, execution_year, execution_quarter) %>%
  group_by(FUNCTIONAL_AREA_NODE_1_MODIFIED) %>%
  mutate(
    breakdowns_last_quarter = lag(records, 1),
    breakdowns_same_quarter_last_year = lag(records, 4),
    avg_breakdowns_last_4_quarters = rollmean(records, k = 4, fill = NA, align = "right"),
    total_breakdowns_last_4_quarters = rollsum(records, k = 4, fill = NA, align = "right")
  ) %>%
  ungroup()

# handle missing values
data_model <- data_model %>%
  mutate(
    breakdowns_last_quarter = ifelse(is.na(breakdowns_last_quarter), 0, breakdowns_last_quarter),
    breakdowns_same_quarter_last_year = ifelse(is.na(breakdowns_same_quarter_last_year), 0, breakdowns_same_quarter_last_year),
    avg_breakdowns_last_4_quarters = ifelse(is.na(avg_breakdowns_last_4_quarters), 0, avg_breakdowns_last_4_quarters),
    total_breakdowns_last_4_quarters = ifelse(is.na(total_breakdowns_last_4_quarters), 0, total_breakdowns_last_4_quarters)
  )

# one-hot encode categorical variables if necessary
data_model <- data_model %>%
  mutate(FUNCTIONAL_AREA_NODE_1_MODIFIED = as.factor(FUNCTIONAL_AREA_NODE_1_MODIFIED))

Data Cleaning for Work Order Time Model

Our final model includes the text data, so we have to do a bit of work to wrangle the data into a usable format. We filter out entries with missing ORDER_DESCRIPTION and remove columns that have a majority of missing values to ensure the dataset remains relevant and manageable. Missing values in specific columns are replaced with a default text value (“missing_value”) to maintain consistency. We then convert several categorical variables into factors for proper handling in modeling and combine various text variables into a single full_text column. After dropping the redundant columns, we create a text corpus from the full_text data, preprocessing the text by converting it to lowercase, removing punctuation, eliminating stopwords, and stripping whitespace. A Document-Term Matrix (DTM) is generated from this corpus, with sparse terms removed to focus on more meaningful text features. Finally, the DTM is combined with the cleaned dataset, resulting in a structured data frame ready for analysis and modeling.

# prep data for work time model
raw_data <- read_csv("IWC_Work_Orders_Extract.csv")

# filter out missing values
df <- raw_data %>%
  filter(!is.na(ORDER_DESCRIPTION))

# remove columns with majority missing
df <- df %>%
  select(-FUNCTIONAL_AREA_NODE_5_MODIFIED,
         -EQUIPMENT_DESC,
         -EQUIP_CAT_DESC,
         -EQUIP_START_UP_DATE,
         -EQUIP_VALID_FROM,
         -EQUIP_VALID_TO)

# replace NAs in specific columns with a default text value 
df <- df %>%
  mutate(across(c(MAINTENANCE_PLAN, MAINTENANCE_ITEM, FUNCTIONAL_LOC, FUNCTIONAL_AREA_NODE_1_MODIFIED, FUNCTIONAL_AREA_NODE_2_MODIFIED, FUNCTIONAL_AREA_NODE_3_MODIFIED, FUNCTIONAL_AREA_NODE_4_MODIFIED), ~ replace_na(., "missing_value"))
  )

# change categorical variables to factors, combine text variables
df <- df %>%
  mutate(MAINTENANCE_ACTIVITY_TYPE = as_factor(MAINTENANCE_ACTIVITY_TYPE)) %>%
  mutate(PLANT_ID = as_factor(PLANT_ID)) %>%
  mutate(PRODUCTION_LOCATION = as_factor(PRODUCTION_LOCATION)) %>%
  mutate(MAINTENANCE_TYPE_DESCRIPTION = as_factor(MAINTENANCE_TYPE_DESCRIPTION)) %>%
  mutate(EQUIPMENT_ID = as_factor(EQUIPMENT_ID)) %>%
  mutate(full_text = paste(MAINTENANCE_PLAN,
                           MAINTENANCE_ITEM,
                           ORDER_DESCRIPTION,
                           FUNCTIONAL_LOC,
                           FUNCTIONAL_AREA_NODE_1_MODIFIED,
                           FUNCTIONAL_AREA_NODE_2_MODIFIED,
                           FUNCTIONAL_AREA_NODE_3_MODIFIED,
                           FUNCTIONAL_AREA_NODE_4_MODIFIED))

# drop unnecessary columns, and ones we wouldn't have for new data
df <- df %>%
  select(-MAINTENANCE_PLAN,
         -MAINTENANCE_ITEM,
         -ORDER_DESCRIPTION,
         -FUNCTIONAL_LOC,
         -FUNCTIONAL_AREA_NODE_1_MODIFIED,
         -FUNCTIONAL_AREA_NODE_2_MODIFIED,
         -FUNCTIONAL_AREA_NODE_3_MODIFIED,
         -FUNCTIONAL_AREA_NODE_4_MODIFIED,
         -ORDER_ID,
         -EXECUTION_FINISH_DATE,
         -ACTUAL_FINISH_TIME)

# create a text corpus
corpus <- VCorpus(VectorSource(df$full_text))

# Preprocess the text
corpus <- tm_map(corpus, content_transformer(tolower))  # Convert to lowercase
corpus <- tm_map(corpus, removePunctuation)              # Remove punctuation
corpus <- tm_map(corpus, removeWords, stopwords("en"))  # Remove stopwords
corpus <- tm_map(corpus, stripWhitespace)                # Strip whitespace

# Create a Document-Term Matrix
dtm <- DocumentTermMatrix(corpus)
dtm <- removeSparseTerms(dtm, 0.99)  # Remove sparse terms
data_dtm <- as.data.frame(as.matrix(dtm))

# Combine the document-term matrix with existing data
final_data <- cbind(data_dtm, df) %>%
  select(-full_text) %>%
  rename_with(~ ifelse(grepl("^[0-9]", .), paste0("X", .), .))

Survival Analysis Model

Identify First Failures

# Identify the First Failure Event for Each FUNCTIONAL_AREA_NODE_2_MODIFIED
failure_data <- data_sa %>%
  filter(failure == 1) %>% 
  group_by(FUNCTIONAL_AREA_NODE_2_MODIFIED) %>% 
  summarise(
    Avg_time_to_failure = mean(time_to_failure, na.rm = TRUE),
    equipment_type = first(na.omit(EQUIPMENT_DESC)), # Selects first non-NA value
    plant_id = first(PLANT_ID),
    EQUIPMENT_ID = first(EQUIPMENT_ID),
    failure = 1
  )

head(failure_data)

## # A tibble: 6 × 6
##   FUNCTIONAL_AREA_NODE_2_MODIFIED Avg_time_to_failure equipment_type    plant_id
##   <fct>                                         <dbl> <chr>             <fct>   
## 1 AIR  SYSTEMS                                   5.95 SULLAIR 200 HP C… G221    
## 2 AIR SYSTEMS                                    1.36 TANK_STL_STOR_AIR G261    
## 3 AMMONIA & REFRIGERATION SYSTEMS                3.48 CHLLR_EVAP        G291    
## 4 BOTTLE LINE                                    1.00 L4 NITROGEN_DOSE… G816    
## 5 BTL_PET_LINE                                   1.00 L1 / L2 PALLETIZ… G291    
## 6 CAN LINE                                       1.00 L3 PACKER_PLASRI… G816    
## # ℹ 2 more variables: EQUIPMENT_ID <fct>, failure <dbl>

The output provides the first few rows of failure_data, which includes:

Avg_time_to_failure: This is the average time until failure for equipment within each FUNCTIONAL_AREA_NODE_2_MODIFIED. For example, equipment in the “AIR SYSTEMS” area has an average time to failure of 1.36 days.

Equipment Details: The selected equipment_type, plant_id, and EQUIPMENT_ID represent the first encountered failure within each area, giving insights into the equipment type and plant where failures are frequent.

This output indicates which functional areas and specific equipment types may need more frequent or preventive maintenance due to shorter average times to failure.

Identify Censored Equipments

Equipment without any failure events are considered censored.

# Calculate censored data for equipment with no failures
censored_data <- data_sa %>%
  group_by(FUNCTIONAL_AREA_NODE_2_MODIFIED) %>%
  filter((failure == 0)) %>% 
  summarise(
    Avg_time_to_failure = as.numeric(difftime(max(EXECUTION_START_DATE), first(EQUIP_VALID_FROM), units = "days")),
    equipment_type = first(na.omit(EQUIPMENT_DESC)),  # Selects first non-NA equipment type
    plant_id = first(PLANT_ID),
    EQUIPMENT_ID = first(EQUIPMENT_ID),
    failure = 0
  )
head(censored_data)

## # A tibble: 6 × 6
##   FUNCTIONAL_AREA_NODE_2_MODIFIED Avg_time_to_failure equipment_type    plant_id
##   <fct>                                         <dbl> <chr>             <fct>   
## 1 AIR  SYSTEMS                                     NA ALIQUOT BLOWMOLD… G221    
## 2 AIR SYSTEMS                                    2750 DRYER_AIR_REFRIG  G291    
## 3 AMMONIA & REFRIGERATION SYSTEMS                2749 COND_EVAP_TOWER   G291    
## 4 BOTTLE LINE                                      NA L2 LABELER_BOTTL… G816    
## 5 BTL_PET_LINE                                     NA LINE 1 DEPAL_ELE… G291    
## 6 CAN LINE                                         NA L3 PACKER_CASE_T… G816    
## # ℹ 2 more variables: EQUIPMENT_ID <fct>, failure <dbl>

In censored_data, some rows show Avg_time_to_failure as NA, which suggests either missing date information or that the duration calculation didn’t produce a valid result. These NA values may need investigation to confirm whether these equipment items truly have no time-to-failure information. For equipment with valid Avg_time_to_failure values (like 2750 days for “AIR SYSTEMS” equipment DRYER_AIR_REFRIG), the data gives a measure of the equipment’s operational duration without failure, useful for identifying potentially robust equipment.

Combine Failure and Censored Data

# Combine the datasets
survival_data <- bind_rows(failure_data, censored_data)

# Ensure no duplicates
survival_data <- survival_data %>% distinct(FUNCTIONAL_AREA_NODE_2_MODIFIED, .keep_all = TRUE)

The combination enables the survival analysis to consider both failure (event) and censored (non-failure) data, crucial for accurately modeling survival times in a maintenance context.

Create Survival Object

# Create survival object
surv_object <- Surv(time = survival_data$Avg_time_to_failure, event = survival_data$failure)

The surv_object is created using the Surv() function, with Avg_time_to_failure as the survival time and failure as the event indicator. This object is essential for fitting the survival model.

The survival object represents the core data structure for survival analysis, combining the time and event status for each functional area. This step prepares the data for Kaplan-Meier estimation and other survival analysis techniques.

Kaplan-Meier Estimator

# Fit Kaplan-Meier survival model
km_fit <- survfit(surv_object ~ 1)

# Plot the survival curve
ggsurvplot(km_fit, data = survival_data, conf.int = TRUE, pval = TRUE,
           title = "Kaplan-Meier Survival Curve",
           xlab = "Time to Failure (Days)", ylab = "Survival Probability")

## Warning in .pvalue(fit, data = data, method = method, pval = pval, pval.coord = pval.coord, : There are no survival curves to be compared. 
##  This is a null model.

The Kaplan-Meier curve is a visual indicator of equipment reliability over time, which reveals potentially high-risk periods where survival drops sharply and differences in survival probabilities between failure-prone areas and more robust equipment setups, which is useful for prioritizing maintenance resources effectively.

Interpretation of Kaplan-Meier Survival Curve

Survival Probability: The y-axis shows the probability of equipment surviving without failure over time (days on the x-axis). A steeper decline in the curve indicates faster equipment failure, while a more gradual slope suggests greater equipment longevity.

Confidence Intervals: Shaded areas around the curve represent confidence intervals, offering a range within which we expect the true survival probabilities to fall.

Survival Analysis by Equipment Type

# Ensure `equipment_type` is a factor and sort levels for consistency
survival_data <- survival_data %>%
  mutate(equipment_type = as.factor(equipment_type))

# Get unique equipment types and split into groups of 5-7
equipment_groups <- split(levels(survival_data$equipment_type), 
                          ceiling(seq_along(levels(survival_data$equipment_type)) / 7))

# Create a list to store plots
plot_list <- list()

# Loop through each group, fit survival curves, and create plots
for (i in seq_along(equipment_groups)) {
  
  # Filter survival data for the current group of equipment types
  group_data <- survival_data %>%
    filter(equipment_type %in% equipment_groups[[i]])
  
  # Create a survival object for the group
  surv_object_group <- Surv(time = group_data$Avg_time_to_failure, event = group_data$failure)
  
  # Fit Kaplan-Meier survival curves for the group
  km_fit_group <- survfit(surv_object_group ~ equipment_type, data = group_data)
  
  # Plot the survival curves for the group without p-value
  plot_list[[i]] <- ggsurvplot(
    km_fit_group, data = group_data, conf.int = TRUE, pval = FALSE,
    title = paste("Survival Curves for Equipment Types - Group", i),
    xlab = "Time to Failure (Days)", ylab = "Survival Probability",
    legend.title = "Equipment Type",
    legend = "right",
    ggtheme = theme_minimal(base_size = 10),
    font.legend = 8,
    palette = "Dark2"
  )
}

# Display all plots one by one
for (i in seq_along(plot_list)) {
  print(plot_list[[i]])
  Sys.sleep(1)  # Optional: pause for 1 second between plots for better readability
}

Survival Analysis by Equipment_IDs

# Filter the survival_data for relevant Equipment_IDs
filtered_data <- survival_data %>%
  filter(FUNCTIONAL_AREA_NODE_2_MODIFIED != "G812 SHOP / REPAIR AREA")

# Ensure `EQUIPMENT_ID` is a factor and sort levels for consistency
filtered_data <- filtered_data %>%
  mutate(EQUIPMENT_ID = as.factor(EQUIPMENT_ID))

# Define the number of Equipment_IDs per plot
num_ids_per_plot <- 5

# Create a list to store plots
plot_list_1 <- list()

# Split filtered data into chunks of 5 Equipment_IDs each
unique_ids <- unique(filtered_data$EQUIPMENT_ID)
id_chunks <- split(unique_ids, ceiling(seq_along(unique_ids) / num_ids_per_plot))

# Loop through each chunk, fit survival curves, and create plots
for (i in seq_along(id_chunks)) {
  
  # Filter survival data for the current chunk of Equipment_IDs
  group_data_1 <- filtered_data %>%
    filter(EQUIPMENT_ID %in% id_chunks[[i]])
  
  # Skip if group_data_1 is empty or has all missing Avg_time_to_failure or failure
  if (nrow(group_data_1) == 0 || all(is.na(group_data_1$Avg_time_to_failure)) || all(is.na(group_data_1$failure))) {
    next  # Skip this group and move to the next one
  }
  
  # Create a survival object for the group
  surv_object_group_1 <- Surv(time = group_data_1$Avg_time_to_failure, event = group_data_1$failure)
  
  # Fit Kaplan-Meier survival curves for the group
  km_fit_group_1 <- survfit(surv_object_group_1 ~ EQUIPMENT_ID, data = group_data_1)
  
  # Plot the survival curves for the group without p-value
  plot_list_1[[i]] <- ggsurvplot(
    km_fit_group_1, data = group_data_1, conf.int = TRUE, pval = FALSE,
    title = paste("Survival Curves for Equipment IDs - Group", i),
    xlab = "Time to Failure (Days)", ylab = "Survival Probability",
    legend.title = "Equipment IDs",
    legend = "right",
    ggtheme = theme_minimal(base_size = 10),
    font.legend = 8,
    palette = "Dark2"
  )
}

# Display all plots in plot_list_1, skipping any empty entries
for (i in seq_along(plot_list_1)) {
  if (!is.null(plot_list_1[[i]])) {
    print(plot_list_1[[i]])
  }
}

Cox Proportional Hazards Model

# Fit Cox Proportional Hazards model
cox_model <- coxph(surv_object ~ equipment_type + plant_id, data = survival_data)

## Warning in coxph.fit(X, Y, istrat, offset, init, control, weights = weights, :
## Ran out of iterations and did not converge

# Model summary
summary(cox_model)

## Call:
## coxph(formula = surv_object ~ equipment_type + plant_id, data = survival_data)
## 
##   n= 43, number of events= 43 
##    (24 observations deleted due to missingness)
## 
##                                                              coef   exp(coef)
## equipment_typeCIP #1 PUMP - P901                       -1.724e+02   1.364e-75
## equipment_typeCO2 ANALYZER MAX-BEV REMOTE TRUCK PANEL   5.209e+01   4.206e+22
## equipment_typeCODER_INKJET_LEIBINGER_JET3               3.533e+01   2.210e+15
## equipment_typeCODER-LINE 1 FULL CAN EAST               -1.778e+02   6.336e-78
## equipment_typeCOND_EVAP_NH3_ATC-1708E_WEST              9.201e+01   9.146e+39
## equipment_typeCONV_CBL_EMPTY_CAN                        2.212e+02   1.198e+96
## equipment_typeFILTERING_SYS_RO                          1.684e+02   1.377e+73
## equipment_typeGENIE SIZZOR LIFT GS-2632                 1.918e+01   2.138e+08
## equipment_typeHVAC_EXH_FAN_EF1                         -2.636e+02  3.315e-115
## equipment_typeINJ_SYS_NIT                              -7.423e+01   5.810e-33
## equipment_typeINTERNAL ROLL UP DOOR                     1.719e+02   4.704e+74
## equipment_typeL02 PALLET LABEL APPLICATOR               2.897e+02  6.263e+125
## equipment_typeL04 KRONES PALLETIZER                     2.143e+02   1.204e+93
## equipment_typeL1 / L2 PALLETIZER_ELECPNEU               3.347e+02  2.293e+145
## equipment_typeL1 DETECTOR_FILL_VISION_PRESSCO FHCP3X    3.930e+02  4.567e+170
## equipment_typeL1 EAST FILTEC                            1.895e+02   2.067e+82
## equipment_typeL1 PALLET LABELER                         2.320e+02  5.889e+100
## equipment_typeL2 ALLIANCE-CCF203 SPIRAL INCLINE         2.622e+02  7.587e+113
## equipment_typeL2 WULFTEC STRETCH WRAPPER                1.945e+02   2.818e+84
## equipment_typeL3 INTERLOX_ CASE_DIVERTER_S7050          3.247e+02  1.062e+141
## equipment_typeL3 PACKER_PLASRING_HICONE                 3.412e+02  1.577e+148
## equipment_typeL4 NITROGEN_DOSER_CRYODOSER 2K            3.147e+02  4.632e+136
## equipment_typeLAB EQUIP_VIDEO_SEAM                     -4.753e+01   2.279e-21
## equipment_typeLIFT_EQUIP_CARTON                        -2.982e+02  3.002e-130
## equipment_typeMICROBLEND MICRO2 BLENDER (LINE 1)       -1.288e+02   1.208e-56
## equipment_typeNIBTORQUE - AIR TORQUE WRENCH            -1.131e+02   7.554e-50
## equipment_typeNORTH NATURAL GAS COMPRESSOR             -1.123e+02   1.697e-49
## equipment_typePERMEATE TANK UV SYSTEM                  -1.450e+02   1.067e-63
## equipment_typePUMP_CENT_HOT_WATER_BUILDING              8.610e+01   2.464e+37
## equipment_typePUMP_DIA_CO2_CYLINDER_1                  -1.968e+02   3.540e-86
## equipment_typePUMP_HYD_CARDBOAR_COMPACTOR               2.479e+01   5.808e+10
## equipment_typeROGERS AIR COMPRESSOR (COMP #1)          -2.764e+02  9.412e-121
## equipment_typeSELLERS BOILER #2                        -3.438e+02  4.836e-150
## equipment_typeSUGAR UV SANITATION                      -2.606e+02  6.755e-114
## equipment_typeSULLAIR 200 HP COMP.                     -1.397e+02   2.211e-61
## equipment_typeSWTCHGR_PANEL_IO_MAIN_CONTROL            -3.164e+02  3.970e-138
## equipment_typeTANK_SS_STOR_PH_NEUTRALIZATION_#2_5000G   1.651e+01   1.487e+07
## equipment_typeTANK_SS_STOR_SANTIZING_2                 -2.052e+02   7.632e-90
## equipment_typeTANK_SS_STOR_TANK_3                       1.340e+02   1.523e+58
## equipment_typeTANK_STL_STOR_AIR                         1.201e+02   1.392e+52
## equipment_typeWEST SYRUP ROOM COMPUTERS                -5.782e+01   7.725e-26
## equipment_typeWRAPPER_STRETCH_TURNTBL_SM004036         -2.299e+02  1.384e-100
## plant_idG261                                            6.141e+00   4.647e+02
## plant_idG291                                           -2.492e+01   1.507e-11
## plant_idG811                                           -1.102e+01   1.639e-05
## plant_idG812                                            3.815e+01   3.707e+16
## plant_idG816                                            3.891e+01   7.952e+16
##                                                          se(coef)      z
## equipment_typeCIP #1 PUMP - P901                        2.547e+03 -0.068
## equipment_typeCO2 ANALYZER MAX-BEV REMOTE TRUCK PANEL   3.147e+03  0.017
## equipment_typeCODER_INKJET_LEIBINGER_JET3               3.028e+03  0.012
## equipment_typeCODER-LINE 1 FULL CAN EAST                2.585e+03 -0.069
## equipment_typeCOND_EVAP_NH3_ATC-1708E_WEST              3.450e+03  0.027
## equipment_typeCONV_CBL_EMPTY_CAN                        5.442e+03  0.041
## equipment_typeFILTERING_SYS_RO                          4.149e+03  0.041
## equipment_typeGENIE SIZZOR LIFT GS-2632                 2.837e+03  0.007
## equipment_typeHVAC_EXH_FAN_EF1                          3.514e+03 -0.075
## equipment_typeINJ_SYS_NIT                               2.563e+03 -0.029
## equipment_typeINTERNAL ROLL UP DOOR                     4.487e+03  0.038
## equipment_typeL02 PALLET LABEL APPLICATOR               1.374e+04  0.021
## equipment_typeL04 KRONES PALLETIZER                     6.140e+03  0.035
## equipment_typeL1 / L2 PALLETIZER_ELECPNEU               1.995e+04  0.017
## equipment_typeL1 DETECTOR_FILL_VISION_PRESSCO FHCP3X    1.301e+04  0.030
## equipment_typeL1 EAST FILTEC                            4.907e+03  0.039
## equipment_typeL1 PALLET LABELER                         1.044e+04  0.022
## equipment_typeL2 ALLIANCE-CCF203 SPIRAL INCLINE         8.423e+03  0.031
## equipment_typeL2 WULFTEC STRETCH WRAPPER                7.085e+03  0.027
## equipment_typeL3 INTERLOX_ CASE_DIVERTER_S7050          3.463e+04  0.009
## equipment_typeL3 PACKER_PLASRING_HICONE                 1.300e+04  0.026
## equipment_typeL4 NITROGEN_DOSER_CRYODOSER 2K            8.544e+04  0.004
## equipment_typeLAB EQUIP_VIDEO_SEAM                      2.643e+03 -0.018
## equipment_typeLIFT_EQUIP_CARTON                         5.099e+03 -0.058
## equipment_typeMICROBLEND MICRO2 BLENDER (LINE 1)        2.536e+03 -0.051
## equipment_typeNIBTORQUE - AIR TORQUE WRENCH             2.519e+03 -0.045
## equipment_typeNORTH NATURAL GAS COMPRESSOR              2.510e+03 -0.045
## equipment_typePERMEATE TANK UV SYSTEM                   2.523e+03 -0.057
## equipment_typePUMP_CENT_HOT_WATER_BUILDING              3.287e+03  0.026
## equipment_typePUMP_DIA_CO2_CYLINDER_1                   2.719e+03 -0.072
## equipment_typePUMP_HYD_CARDBOAR_COMPACTOR               2.925e+03  0.008
## equipment_typeROGERS AIR COMPRESSOR (COMP #1)           2.826e+03 -0.098
## equipment_typeSELLERS BOILER #2                         4.045e+03 -0.085
## equipment_typeSUGAR UV SANITATION                       3.191e+03 -0.082
## equipment_typeSULLAIR 200 HP COMP.                      2.511e+03 -0.056
## equipment_typeSWTCHGR_PANEL_IO_MAIN_CONTROL             8.695e+03 -0.036
## equipment_typeTANK_SS_STOR_PH_NEUTRALIZATION_#2_5000G   2.761e+03  0.006
## equipment_typeTANK_SS_STOR_SANTIZING_2                  2.641e+03 -0.078
## equipment_typeTANK_SS_STOR_TANK_3                       3.643e+03  0.037
## equipment_typeTANK_STL_STOR_AIR                         3.873e+03  0.031
## equipment_typeWEST SYRUP ROOM COMPUTERS                 2.598e+03 -0.022
## equipment_typeWRAPPER_STRETCH_TURNTBL_SM004036          2.976e+03 -0.077
## plant_idG261                                            1.445e+03  0.004
## plant_idG291                                            1.232e+03 -0.020
## plant_idG811                                            1.091e+03 -0.010
## plant_idG812                                            1.642e+03  0.023
## plant_idG816                                            1.286e+04  0.003
##                                                       Pr(>|z|)
## equipment_typeCIP #1 PUMP - P901                         0.946
## equipment_typeCO2 ANALYZER MAX-BEV REMOTE TRUCK PANEL    0.987
## equipment_typeCODER_INKJET_LEIBINGER_JET3                0.991
## equipment_typeCODER-LINE 1 FULL CAN EAST                 0.945
## equipment_typeCOND_EVAP_NH3_ATC-1708E_WEST               0.979
## equipment_typeCONV_CBL_EMPTY_CAN                         0.968
## equipment_typeFILTERING_SYS_RO                           0.968
## equipment_typeGENIE SIZZOR LIFT GS-2632                  0.995
## equipment_typeHVAC_EXH_FAN_EF1                           0.940
## equipment_typeINJ_SYS_NIT                                0.977
## equipment_typeINTERNAL ROLL UP DOOR                      0.969
## equipment_typeL02 PALLET LABEL APPLICATOR                0.983
## equipment_typeL04 KRONES PALLETIZER                      0.972
## equipment_typeL1 / L2 PALLETIZER_ELECPNEU                0.987
## equipment_typeL1 DETECTOR_FILL_VISION_PRESSCO FHCP3X     0.976
## equipment_typeL1 EAST FILTEC                             0.969
## equipment_typeL1 PALLET LABELER                          0.982
## equipment_typeL2 ALLIANCE-CCF203 SPIRAL INCLINE          0.975
## equipment_typeL2 WULFTEC STRETCH WRAPPER                 0.978
## equipment_typeL3 INTERLOX_ CASE_DIVERTER_S7050           0.993
## equipment_typeL3 PACKER_PLASRING_HICONE                  0.979
## equipment_typeL4 NITROGEN_DOSER_CRYODOSER 2K             0.997
## equipment_typeLAB EQUIP_VIDEO_SEAM                       0.986
## equipment_typeLIFT_EQUIP_CARTON                          0.953
## equipment_typeMICROBLEND MICRO2 BLENDER (LINE 1)         0.960
## equipment_typeNIBTORQUE - AIR TORQUE WRENCH              0.964
## equipment_typeNORTH NATURAL GAS COMPRESSOR               0.964
## equipment_typePERMEATE TANK UV SYSTEM                    0.954
## equipment_typePUMP_CENT_HOT_WATER_BUILDING               0.979
## equipment_typePUMP_DIA_CO2_CYLINDER_1                    0.942
## equipment_typePUMP_HYD_CARDBOAR_COMPACTOR                0.993
## equipment_typeROGERS AIR COMPRESSOR (COMP #1)            0.922
## equipment_typeSELLERS BOILER #2                          0.932
## equipment_typeSUGAR UV SANITATION                        0.935
## equipment_typeSULLAIR 200 HP COMP.                       0.956
## equipment_typeSWTCHGR_PANEL_IO_MAIN_CONTROL              0.971
## equipment_typeTANK_SS_STOR_PH_NEUTRALIZATION_#2_5000G    0.995
## equipment_typeTANK_SS_STOR_SANTIZING_2                   0.938
## equipment_typeTANK_SS_STOR_TANK_3                        0.971
## equipment_typeTANK_STL_STOR_AIR                          0.975
## equipment_typeWEST SYRUP ROOM COMPUTERS                  0.982
## equipment_typeWRAPPER_STRETCH_TURNTBL_SM004036           0.938
## plant_idG261                                             0.997
## plant_idG291                                             0.984
## plant_idG811                                             0.992
## plant_idG812                                             0.981
## plant_idG816                                             0.998
## 
##                                                        exp(coef) exp(-coef)
## equipment_typeCIP #1 PUMP - P901                       1.364e-75  7.332e+74
## equipment_typeCO2 ANALYZER MAX-BEV REMOTE TRUCK PANEL  4.206e+22  2.377e-23
## equipment_typeCODER_INKJET_LEIBINGER_JET3              2.210e+15  4.524e-16
## equipment_typeCODER-LINE 1 FULL CAN EAST               6.336e-78  1.578e+77
## equipment_typeCOND_EVAP_NH3_ATC-1708E_WEST             9.146e+39  1.093e-40
## equipment_typeCONV_CBL_EMPTY_CAN                       1.198e+96  8.347e-97
## equipment_typeFILTERING_SYS_RO                         1.377e+73  7.264e-74
## equipment_typeGENIE SIZZOR LIFT GS-2632                2.138e+08  4.677e-09
## equipment_typeHVAC_EXH_FAN_EF1                        3.315e-115 3.016e+114
## equipment_typeINJ_SYS_NIT                              5.810e-33  1.721e+32
## equipment_typeINTERNAL ROLL UP DOOR                    4.704e+74  2.126e-75
## equipment_typeL02 PALLET LABEL APPLICATOR             6.263e+125 1.597e-126
## equipment_typeL04 KRONES PALLETIZER                    1.204e+93  8.309e-94
## equipment_typeL1 / L2 PALLETIZER_ELECPNEU             2.293e+145 4.362e-146
## equipment_typeL1 DETECTOR_FILL_VISION_PRESSCO FHCP3X  4.567e+170 2.190e-171
## equipment_typeL1 EAST FILTEC                           2.067e+82  4.837e-83
## equipment_typeL1 PALLET LABELER                       5.889e+100 1.698e-101
## equipment_typeL2 ALLIANCE-CCF203 SPIRAL INCLINE       7.587e+113 1.318e-114
## equipment_typeL2 WULFTEC STRETCH WRAPPER               2.818e+84  3.549e-85
## equipment_typeL3 INTERLOX_ CASE_DIVERTER_S7050        1.062e+141 9.413e-142
## equipment_typeL3 PACKER_PLASRING_HICONE               1.577e+148 6.340e-149
## equipment_typeL4 NITROGEN_DOSER_CRYODOSER 2K          4.632e+136 2.159e-137
## equipment_typeLAB EQUIP_VIDEO_SEAM                     2.279e-21  4.388e+20
## equipment_typeLIFT_EQUIP_CARTON                       3.002e-130 3.332e+129
## equipment_typeMICROBLEND MICRO2 BLENDER (LINE 1)       1.208e-56  8.280e+55
## equipment_typeNIBTORQUE - AIR TORQUE WRENCH            7.554e-50  1.324e+49
## equipment_typeNORTH NATURAL GAS COMPRESSOR             1.697e-49  5.893e+48
## equipment_typePERMEATE TANK UV SYSTEM                  1.067e-63  9.375e+62
## equipment_typePUMP_CENT_HOT_WATER_BUILDING             2.464e+37  4.059e-38
## equipment_typePUMP_DIA_CO2_CYLINDER_1                  3.540e-86  2.825e+85
## equipment_typePUMP_HYD_CARDBOAR_COMPACTOR              5.808e+10  1.722e-11
## equipment_typeROGERS AIR COMPRESSOR (COMP #1)         9.412e-121 1.063e+120
## equipment_typeSELLERS BOILER #2                       4.836e-150 2.068e+149
## equipment_typeSUGAR UV SANITATION                     6.755e-114 1.480e+113
## equipment_typeSULLAIR 200 HP COMP.                     2.211e-61  4.523e+60
## equipment_typeSWTCHGR_PANEL_IO_MAIN_CONTROL           3.970e-138 2.519e+137
## equipment_typeTANK_SS_STOR_PH_NEUTRALIZATION_#2_5000G  1.487e+07  6.726e-08
## equipment_typeTANK_SS_STOR_SANTIZING_2                 7.632e-90  1.310e+89
## equipment_typeTANK_SS_STOR_TANK_3                      1.523e+58  6.565e-59
## equipment_typeTANK_STL_STOR_AIR                        1.392e+52  7.186e-53
## equipment_typeWEST SYRUP ROOM COMPUTERS                7.725e-26  1.295e+25
## equipment_typeWRAPPER_STRETCH_TURNTBL_SM004036        1.384e-100  7.225e+99
## plant_idG261                                           4.647e+02  2.152e-03
## plant_idG291                                           1.507e-11  6.635e+10
## plant_idG811                                           1.639e-05  6.101e+04
## plant_idG812                                           3.707e+16  2.698e-17
## plant_idG816                                           7.952e+16  1.258e-17
##                                                       lower .95 upper .95
## equipment_typeCIP #1 PUMP - P901                              0       Inf
## equipment_typeCO2 ANALYZER MAX-BEV REMOTE TRUCK PANEL         0       Inf
## equipment_typeCODER_INKJET_LEIBINGER_JET3                     0       Inf
## equipment_typeCODER-LINE 1 FULL CAN EAST                      0       Inf
## equipment_typeCOND_EVAP_NH3_ATC-1708E_WEST                    0       Inf
## equipment_typeCONV_CBL_EMPTY_CAN                              0       Inf
## equipment_typeFILTERING_SYS_RO                                0       Inf
## equipment_typeGENIE SIZZOR LIFT GS-2632                       0       Inf
## equipment_typeHVAC_EXH_FAN_EF1                                0       Inf
## equipment_typeINJ_SYS_NIT                                     0       Inf
## equipment_typeINTERNAL ROLL UP DOOR                           0       Inf
## equipment_typeL02 PALLET LABEL APPLICATOR                     0       Inf
## equipment_typeL04 KRONES PALLETIZER                           0       Inf
## equipment_typeL1 / L2 PALLETIZER_ELECPNEU                     0       Inf
## equipment_typeL1 DETECTOR_FILL_VISION_PRESSCO FHCP3X          0       Inf
## equipment_typeL1 EAST FILTEC                                  0       Inf
## equipment_typeL1 PALLET LABELER                               0       Inf
## equipment_typeL2 ALLIANCE-CCF203 SPIRAL INCLINE               0       Inf
## equipment_typeL2 WULFTEC STRETCH WRAPPER                      0       Inf
## equipment_typeL3 INTERLOX_ CASE_DIVERTER_S7050                0       Inf
## equipment_typeL3 PACKER_PLASRING_HICONE                       0       Inf
## equipment_typeL4 NITROGEN_DOSER_CRYODOSER 2K                  0       Inf
## equipment_typeLAB EQUIP_VIDEO_SEAM                            0       Inf
## equipment_typeLIFT_EQUIP_CARTON                               0       Inf
## equipment_typeMICROBLEND MICRO2 BLENDER (LINE 1)              0       Inf
## equipment_typeNIBTORQUE - AIR TORQUE WRENCH                   0       Inf
## equipment_typeNORTH NATURAL GAS COMPRESSOR                    0       Inf
## equipment_typePERMEATE TANK UV SYSTEM                         0       Inf
## equipment_typePUMP_CENT_HOT_WATER_BUILDING                    0       Inf
## equipment_typePUMP_DIA_CO2_CYLINDER_1                         0       Inf
## equipment_typePUMP_HYD_CARDBOAR_COMPACTOR                     0       Inf
## equipment_typeROGERS AIR COMPRESSOR (COMP #1)                 0       Inf
## equipment_typeSELLERS BOILER #2                               0       Inf
## equipment_typeSUGAR UV SANITATION                             0       Inf
## equipment_typeSULLAIR 200 HP COMP.                            0       Inf
## equipment_typeSWTCHGR_PANEL_IO_MAIN_CONTROL                   0       Inf
## equipment_typeTANK_SS_STOR_PH_NEUTRALIZATION_#2_5000G         0       Inf
## equipment_typeTANK_SS_STOR_SANTIZING_2                        0       Inf
## equipment_typeTANK_SS_STOR_TANK_3                             0       Inf
## equipment_typeTANK_STL_STOR_AIR                               0       Inf
## equipment_typeWEST SYRUP ROOM COMPUTERS                       0       Inf
## equipment_typeWRAPPER_STRETCH_TURNTBL_SM004036                0       Inf
## plant_idG261                                                  0       Inf
## plant_idG291                                                  0       Inf
## plant_idG811                                                  0       Inf
## plant_idG812                                                  0       Inf
## plant_idG816                                                  0       Inf
## 
## Concordance= 1  (se = 0 )
## Likelihood ratio test= 243.1  on 47 df,   p=<2e-16
## Wald test            = 0.09  on 47 df,   p=1
## Score (logrank) test = 158.7  on 47 df,   p=5e-14

Model Summary

The model is fitted using equipment_type and plant_id as predictors for the survival outcome. The model used 43 observations with 43 events (failures), while 24 observations were excluded due to missing data.

Likelihood Ratio, Wald, and Score Tests: The Likelihood Ratio Test (p < 2e-16) and the Score Test (p = 5e-14) are statistically significant, indicating that at least one coefficient in the model is significantly associated with the survival time. However, the Wald Test (p = 1) suggests that the Wald statistic for this model doesn’t provide evidence of a global effect of the covariates, likely due to the large standard errors.

Coefficients (coef):

These values represent the log hazard ratios for each equipment_type and plant_id compared to a baseline. A negative coefficient suggests a protective effect (reducing hazard or risk of failure), while a positive coefficient indicates an increased hazard. Some coefficients (e.g., equipment_typeCIP #1 PUMP - P901) have extreme values, which could indicate sparse data for these types, leading to large or unstable estimates.

Hazard Ratios (exp(coef)):

The hazard ratios (HRs) are the exponentiated coefficients (exp(coef)). For example, an HR of 4.548e-88 indicates an extremely low risk, while an HR of 2.463e+11 indicates an extremely high risk. However, the interpretability of these extreme values is limited, as they are likely the result of very small sample sizes or limited events for those categories.

Generally: HR < 1: Suggests the equipment or plant has a lower hazard (decreased failure risk). HR > 1: Suggests the equipment or plant has a higher hazard (increased failure risk). Some of these values are so large that they might be unreliable due to overfitting or multicollinearity.

Statistical Significance (Pr(>|z|)):

Most predictors have high p-values (close to or above 0.05), indicating a lack of statistical significance. This is likely due to the large standard errors and may suggest limited or imprecise information in the data for certain equipment_type and plant_id categories.

In summary, while the Cox model output provides initial insights into which equipment types and plants are associated with higher or lower hazards, the instability of estimates suggests that refining the model (e.g., by grouping or using regularization) would improve interpretability and reliability.

Check Proportional Hazards Assumption

# Test proportional hazards assumption
cox_zph <- cox.zph(cox_model)
print(cox_zph)

##                  chisq df      p
## equipment_type 997.954 42 <2e-16
## plant_id        -2.726  5      1
## GLOBAL          -0.777 47      1

# Plot Schoenfeld residuals
ggcoxzph(cox_zph)

## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

The cox.zph(cox_model) function conducts a test of the proportional hazards assumption for each covariate in the model as well as for the global model.

p-value: A low p-value (typically < 0.05) indicates a violation of the proportional hazards assumption for that variable, suggesting the effect of the variable changes over time. A high p-value (> 0.05) suggests the proportional hazards assumption is reasonably met for that variable.

It can be seen above that the Cox model assumptions are satisfied by equipment_type and the estimated hazard ratios are valid as constants over time for it. On the other hand these assumptions are not satisfied for plant_id and for the global model.

Time Series Analysis of Failures

# Number of failures over time
failure_over_time <- data_sa %>%
  filter(failure == 1) %>%
  group_by(EXECUTION_START_DATE) %>%
  summarise(failures = n())

# Plot
ggplot(failure_over_time, aes(x = EXECUTION_START_DATE, y = failures)) +
  geom_line(color = "firebrick") +
  theme_minimal() +
  labs(title = "Failures Over Time",
       x = "Date", y = "Number of Failures")

Conclusion

This analysis provides valuable insights into equipment durability across different functional areas and equipment types. The Kaplan-Meier curves show variations in survival probabilities over time, indicating which equipment types and specific IDs have shorter or longer lifespans. While our model’s p-value (0.056) was slightly above the typical threshold for statistical significance (0.05), the results still offer practical information for targeting maintenance efforts. The Cox model further suggests that certain equipment types and plant locations may be linked to higher failure risks, though not all findings were statistically significant. Overall, this analysis can help prioritize equipment maintenance, improve planning, and reduce production interruptions.

Quarterly Predictive Model

The EDA observations suggest a model that incorporates functional area, quarterly breakdown trends, and historical breakdowns to capture patterns in maintenance needs. Lagged features (variables that capture past values of a feature at a specific time interval to help predict future outcomes), moving averages (calculated by averaging a set of past values over a specific time window, smoothing out short-term fluctuations to reveal longer-term trends in the data.), and cumulative breakdown totals over the past quarters could enhance predictive power by reflecting both immediate and long-term maintenance demands. This approach will enable the model to identify high-risk periods and allocate resources efficiently across quarters and functional areas.

Define Target Features

# Define the target variable
target <- "records"

# Define features to include all columns except the target
features <- setdiff(names(data_model), target)

Split data into Training and Testing Sets

# Split the data into training and testing sets (70/30 split)
set.seed(123)
train_indices <- sample(seq_len(nrow(data_model)), size = 0.7 * nrow(data_model))
train_data <- data_model[train_indices, c(target, features)]
test_data <- data_model[-train_indices, c(target, features)]

Train Random Forest

# Train the Random Forest model
set.seed(123)
rf_model <- randomForest(
  as.formula(paste(target, "~ .")), 
  data = train_data, 
  ntree = 100,
  mtry = floor(sqrt(length(features))),
  importance = TRUE
)
# Print basic summary of the model
print(rf_model)

## 
## Call:
##  randomForest(formula = as.formula(paste(target, "~ .")), data = train_data,      ntree = 100, mtry = floor(sqrt(length(features))), importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 100
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 2162022
##                     % Var explained: 98.34

The Random Forest model is a regression model trained with 100 trees, each split using 3 randomly selected features. The Mean of squared residuals (error measure) is 1,411,690, indicating the average squared difference between predicted and actual values. With 98.04% of variance explained, the model fits the data well, capturing nearly all variability in the target variable (records).

Evaluate Performance

# Predict on the test data
predictions <- predict(rf_model, newdata = test_data)

# Calculate performance metrics
mae <- mean(abs(predictions - test_data$records))
rmse <- sqrt(mean((predictions - test_data$records)^2))
r_squared <- cor(predictions, test_data$records)^2

# Print metrics
cat("Mean Absolute Error (MAE):", mae, "\n")

## Mean Absolute Error (MAE): 755.6077

cat("Root Mean Squared Error (RMSE):", rmse, "\n")

## Root Mean Squared Error (RMSE): 1467.443

cat("R-squared:", r_squared, "\n")

## R-squared: 0.9919079

This model demonstrates strong predictive performance based on the provided metrics. breakdown

MAE

# Calculate average and median breakdown counts per quarter
avg_quarterly_count <- mean(data_model$records)
median_quarterly_count <- median(data_model$records)

cat("Average Quarterly Breakdown Count:", avg_quarterly_count, "\n")

## Average Quarterly Breakdown Count: 6547.083

cat("Median Quarterly Breakdown Count:", median_quarterly_count, "\n")

## Median Quarterly Breakdown Count: 1851.5

# MAE % of average and median counts 
mae <- 617.5
mae_percentage_avg <- (mae / avg_quarterly_count) * 100
mae_percentage_median <- (mae / median_quarterly_count) * 100

cat("MAE as % of Average Quarterly Count:", mae_percentage_avg, "%\n")

## MAE as % of Average Quarterly Count: 9.431682 %

cat("MAE as % of Median Quarterly Count:", mae_percentage_median, "%\n")

## MAE as % of Median Quarterly Count: 33.35134 %

MAE as a Percentage of the Average Quarterly Count (12.84%): This suggests the model’s errors are reasonably low relative to the average breakdown count, indicating good precision for the typical breakdown volume.

MAE as a Percentage of the Median Quarterly Count (48%): This higher percentage shows that while the model performs well on average, it might be less accurate for quarters with lower breakdown counts. This could mean the model is slightly overestimating breakdowns for quarters that experience fewer breakdowns, pulling the MAE up relative to the median.

Overall, with an MAE of around 12.8% of the average quarterly count, the model is quite effective at predicting breakdowns, especially when they are closer to average levels. However, if more precision is needed for lower breakdown quarters, further tuning or a complementary model focusing on those lower counts might improve performance.

RMSE

rmse <- 1318.3

# Calculate RMSE as a percentage of the average and median quarterly breakdown count
rmse_percentage_avg <- (rmse / avg_quarterly_count) * 100
rmse_percentage_median <- (rmse / median_quarterly_count) * 100

cat("RMSE as % of Average Quarterly Count:", rmse_percentage_avg, "%\n")

## RMSE as % of Average Quarterly Count: 20.13569 %

cat("RMSE as % of Median Quarterly Count:", rmse_percentage_median, "%\n")

## RMSE as % of Median Quarterly Count: 71.20173 %

RMSE as a Percentage of Average Quarterly Count (27.4%): This suggests that the model’s larger prediction errors are around a quarter of the average quarterly breakdown count, which is acceptable but might indicate room for improvement, especially if high precision is important for decision-making.
RMSE as a Percentage of Median Quarterly Count (102.5%): This percentage is high, meaning the model’s larger errors are roughly equivalent to or even greater than typical lower breakdown quarters. This suggests that while the model captures overall patterns well, it tends to overestimate or experience larger deviations for quarters with lower breakdown counts.

In summary, while the model performs reasonably well in general (especially on average-sized quarters), the high RMSE relative to the median count suggests it may struggle with quarters that have fewer breakdowns. If predictions for lower-count quarters are important, additional tuning or adjustments to the model may improve its accuracy.

Feature Importance

# Feature importance
importance_visual <- importance(rf_model)
importance_visual

##                                     %IncMSE IncNodePurity
## FUNCTIONAL_AREA_NODE_1_MODIFIED    4.454205    2778889245
## execution_year                     2.167372      33820110
## execution_quarter                 -0.831163       7839232
## avg_repair_duration                2.255188      44766892
## unplanned_maintenance_prop         2.114415     519615604
## breakdowns_last_quarter            5.961531    4772372694
## breakdowns_same_quarter_last_year  2.701553     811017153
## avg_breakdowns_last_4_quarters     6.069149    4484877204
## total_breakdowns_last_4_quarters   6.605011    5551170554

feature_importance <- varImpPlot(rf_model, main = "Feature Importance")

feature_importance

##                                     %IncMSE IncNodePurity
## FUNCTIONAL_AREA_NODE_1_MODIFIED    4.454205    2778889245
## execution_year                     2.167372      33820110
## execution_quarter                 -0.831163       7839232
## avg_repair_duration                2.255188      44766892
## unplanned_maintenance_prop         2.114415     519615604
## breakdowns_last_quarter            5.961531    4772372694
## breakdowns_same_quarter_last_year  2.701553     811017153
## avg_breakdowns_last_4_quarters     6.069149    4484877204
## total_breakdowns_last_4_quarters   6.605011    5551170554

Model 2

Log Transformation

# Create a separate copy for the new model transformations
data_model_transformed <- data_model

# Apply transformations on the new copy
data_model_transformed$log_records <- log1p(data_model_transformed$records)  # log1p handles zero values safely

Interaction term

data_model_transformed$interaction_repair_unplanned <- data_model_transformed$avg_repair_duration * data_model_transformed$unplanned_maintenance_prop

Polynomial

data_model_transformed$breakdowns_last_quarter_squared <- data_model_transformed$breakdowns_last_quarter^2

Features

# Define the new target variable and feature set
new_target <- "log_records"
new_features <- c(
  "FUNCTIONAL_AREA_NODE_1_MODIFIED", "avg_repair_duration", "unplanned_maintenance_prop",
  "breakdowns_last_quarter", "breakdowns_same_quarter_last_year",
  "avg_breakdowns_last_4_quarters", "total_breakdowns_last_4_quarters",
  "interaction_repair_unplanned", "breakdowns_last_quarter_squared"
)

Split Data

# Split the transformed data into training and testing sets (70/30 split)
set.seed(123)
train_indices <- sample(seq_len(nrow(data_model_transformed)), size = 0.7 * nrow(data_model_transformed))
train_data_new <- data_model_transformed[train_indices, c(new_target, new_features)]
test_data_new <- data_model_transformed[-train_indices, c(new_target, new_features)]

Models

# Train the rf model with the new target and features
set.seed(123)
rf_model_new <- randomForest(
  as.formula(paste(new_target, "~ .")), 
  data = train_data_new, 
  ntree = 100,                       
  mtry = floor(sqrt(length(new_features))),  
  importance = TRUE                 
)

Predictions

# Predict on the test set
log_predictions <- predict(rf_model_new, newdata = test_data_new)

# Transform predictions back to original scale
predictions <- expm1(log_predictions)

Performance

# Reconfirm test indices
test_indices <- setdiff(seq_len(nrow(data_model_transformed)), train_indices)

# Extract records from the original dataset for these indices
original_records <- data_model$records[test_indices]

# Double-check lengths
cat("Length of predictions:", length(predictions), "\n")

## Length of predictions: 66

cat("Length of original_records:", length(original_records), "\n")

## Length of original_records: 66

# Calculate performance metrics on the original scale
mae_new <- mean(abs(predictions - original_records))
rmse_new <- sqrt(mean((predictions - original_records)^2))
r_squared_new <- cor(predictions, original_records)^2

# Print the results
cat("New Model Mean Absolute Error (MAE):", mae_new, "\n")

## New Model Mean Absolute Error (MAE): 969.5899

cat("New Model Root Mean Squared Error (RMSE):", rmse_new, "\n")

## New Model Root Mean Squared Error (RMSE): 2180.78

cat("New Model R-squared:", r_squared_new, "\n")

## New Model R-squared: 0.978927

MAE

# Calculate MAE percentage for the new model
mae_percentage_avg_new <- (mae_new / avg_quarterly_count) * 100
mae_percentage_median_new <- (mae_new / median_quarterly_count) * 100

# Print results
cat("New Model MAE as % of Average Quarterly Count:", mae_percentage_avg_new, "%\n")

## New Model MAE as % of Average Quarterly Count: 14.8095 %

cat("New Model MAE as % of Median Quarterly Count:", mae_percentage_median_new, "%\n")

## New Model MAE as % of Median Quarterly Count: 52.3678 %

The new model’s Mean Absolute Error (MAE) is approximately 15.6% of the average quarterly breakdown count and 58.3% of the median quarterly breakdown count. This indicates that while the model performs reasonably well on average, its errors are relatively high for quarters with fewer breakdowns, suggesting it still may struggle with accurately predicting low-count quarters.

Feature Importance

# Calculate feature importance
importance_values <- importance(rf_model_new)

# Print the importance values
print(importance_values)

##                                    %IncMSE IncNodePurity
## FUNCTIONAL_AREA_NODE_1_MODIFIED   6.297976     97.943519
## avg_repair_duration               1.740604      3.463234
## unplanned_maintenance_prop        5.299039     45.208169
## breakdowns_last_quarter           6.731553    127.326773
## breakdowns_same_quarter_last_year 2.542531     20.548113
## avg_breakdowns_last_4_quarters    4.045798     68.626205
## total_breakdowns_last_4_quarters  4.030869     65.416587
## interaction_repair_unplanned      5.015716     15.163464
## breakdowns_last_quarter_squared   6.975086    148.716531

# Plot feature importance
varImpPlot(rf_model_new, main = "Feature Importance for New Random Forest Model", 
           n.var = min(10, nrow(importance(rf_model_new))),   # Show top 10 features if too many
           cex = 0.7,                                        # Adjust text size if needed
           las = 2)                                          # Make y-axis labels vertical

The Feature Importance Plot from the Random Forest model shows which variables have the most influence on predicting quarterly breakdown counts.

Top Features:

breakdowns_last_quarter and breakdowns_last_quarter_squared: These variables are the most important, indicating that recent breakdown counts (especially the previous quarter) have a strong predictive impact on future breakdowns. FUNCTIONAL_AREA_NODE_1_MODIFIED: This feature, which likely represents the functional area or department, is also highly influential. This suggests that the likelihood of breakdowns varies significantly across different functional areas. Other Influential Features:

avg_breakdowns_last_4_quarters and total_breakdowns_last_4_quarters: These features, representing breakdown patterns over the last year, also contribute to the prediction. They show that past trends over several quarters help in forecasting future breakdowns. unplanned_maintenance_prop: The proportion of unplanned maintenance events affects breakdown counts, implying that areas with more unplanned maintenance events are more likely to experience breakdowns. Less Impactful Features:

Features like avg_repair_duration and breakdowns_same_quarter_last_year have lower importance. While they provide some predictive value, they are not as influential as the more recent breakdown history or functional area. In summary, the model emphasizes recent breakdown counts and functional area as the strongest indicators for predicting quarterly breakdown counts. This insight can guide maintenance scheduling by prioritizing areas with high recent breakdown counts and tailoring strategies for different functional areas.

Feature Importance By Quarter

# Loop over each quarter, train a model, and capture feature importance
quarterly_importance <- list()

for (q in 1:4) {
  # Filter data for each quarter
  data_q <- data_model_transformed %>% filter(execution_quarter == q)
  
  # Train a Random Forest model on the data for this quarter
  rf_q <- randomForest(
    as.formula(paste(new_target, "~ .")), 
    data = data_q, 
    ntree = 100,
    mtry = floor(sqrt(length(new_features))),
    importance = TRUE
  )
  
  # Get feature importance and store it with quarter information
  importance_q <- as.data.frame(importance(rf_q))
  importance_q$Feature <- rownames(importance_q)
  importance_q$Quarter <- q  
  
  quarterly_importance[[q]] <- importance_q
}

# Combine importance data for all quarters
quarterly_importance_df <- do.call(rbind, quarterly_importance)

# Convert Quarter to a factor to make it a discrete variable for plotting
quarterly_importance_df$Quarter <- as.factor(quarterly_importance_df$Quarter)

# Filter out the target variable 'records' from the importance data frame, if present
quarterly_importance_df <- quarterly_importance_df %>%
  filter(Feature != "records")

# Calculate the average importance of each feature across quarters
avg_importance <- quarterly_importance_df %>%
  group_by(Feature) %>%
  summarize(mean_importance = mean(`%IncMSE`, na.rm = TRUE))

# Merge back to order by mean importance and reorder features
quarterly_importance_df <- quarterly_importance_df %>%
  left_join(avg_importance, by = "Feature") %>%
  mutate(Feature = reorder(Feature, -mean_importance))

# Create the plot with facet_wrap for each quarter
ggplot(quarterly_importance_df, aes(x = reorder(Feature, `%IncMSE`), y = `%IncMSE`, fill = as.factor(Quarter))) +
  geom_bar(stat = "identity", position = "dodge", color = "black") +
  coord_flip() +
  labs(
    title = "Feature Importance by Quarter",
    x = "Feature",
    y = "% Increase in MSE",
    fill = "Quarter"
  ) +
  facet_wrap(~ Quarter, ncol = 2) +  # Facet by Quarter with 2 columns
  theme_minimal() +
  scale_fill_brewer(palette = "Set3") +
  theme(
    axis.text.y = element_text(size = 10, hjust = 1),
    axis.text.x = element_text(size = 10),
    legend.position = "none",
    panel.grid.major.x = element_line(color = "grey", size = 0.2)
  )

## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Interpretation

Top Features Consistent Across Quarters: The features FUNCTIONAL_AREA_NODE_1_MODIFIED, breakdowns_last_quarter_squared, and breakdowns_last_quarter consistently show high importance across all quarters, indicating that historical breakdown data and the specific functional area are strong predictors of future breakdowns.

Seasonal Variation in Importance: Some features show fluctuating importance across quarters. For instance, unplanned_maintenance_prop is more important in Q1 and Q4, while interaction_repair_unplanned varies throughout the year. This suggests that certain maintenance patterns or reactive behaviors impact breakdowns differently depending on the time of year.

Lower Importance of Time-Based Features: execution_year and execution_quarter show low importance across all quarters, indicating that breakdown patterns are not directly tied to the specific year or quarter. This suggests that seasonal trends alone do not fully explain the variation in breakdowns.

Relevance to Stakeholders

For stakeholders, the model demonstrates that:

Historical Data Drives Predictions: The most influential features relate to past breakdowns, supporting a proactive maintenance approach. Knowing that breakdowns in the last quarter are highly predictive of current issues may encourage regular, data-informed check-ups.

Functional Area Significance: The functional area of equipment is consistently crucial, suggesting that breakdown likelihood varies significantly by area. This insight can guide resource allocation to high-risk areas.

Quarter-Specific Maintenance Strategy: With some features showing seasonal shifts, stakeholders may adopt flexible maintenance practices that align with quarterly variations. For example, increased focus on unplanned maintenance in Q1 and Q4 could reduce unexpected breakdowns.

Overall Value

While this visualization reinforces the predictive power of certain features, it provides limited actionable insights specifically tied to quarters. To enhance its value, stakeholders might focus on quarterly maintenance planning for high-risk areas based on this model’s insights. However, if a more nuanced understanding of seasonal trends is required, additional analysis or a deeper focus on quarterly-specific models might be beneficial.

Work Time Model

The goal of this model is to predict ACTUAL_WORK_IN_MINUTES using the other input variables. If accurate, this model would allow Swire to forecast how long work orders will take to address and predict which ones will cause the longest disruptions to its production. Based on the size and structure of the dataset and our limited time and computing power, we are somewhat limited on what types of regression algorithms we can use. We will start with a simple linear regression to set a benchmark for performance. We will then use a random forest to see if that model performs any better with the same variables. Finally we will build a regression tree that uses the text data variables that were generated in the data preparation section. Models will be evaluated with simple hold out cross validation.

# Set seed for reproducibility
set.seed(123)

# Split the data into training and testing sets
trainIndex <- createDataPartition(final_data$ACTUAL_WORK_IN_MINUTES, p = .7, 
                                  list = FALSE, 
                                  times = 1)

train_data <- final_data[trainIndex, ]
test_data <- final_data[-trainIndex, ]

Model 1: Linear Regression

# build model
lm_model <- glm(ACTUAL_WORK_IN_MINUTES ~ PLANT_ID + PRODUCTION_LOCATION + EXECUTION_START_DATE + ACTUAL_START_TIME + MAINTENANCE_TYPE_DESCRIPTION, data = train_data)

# Predict on the training data
train_predictions <- predict(lm_model, newdata = train_data)

# Predict on the test data
test_predictions <- predict(lm_model, newdata = test_data)

# Train RMSE
rmse(train_data$ACTUAL_WORK_IN_MINUTES, train_predictions)

## [1] 297.1823

# Test RMSE
rmse(test_data$ACTUAL_WORK_IN_MINUTES, test_predictions)

## [1] 183.3298

# Train MAE
mae(train_data$ACTUAL_WORK_IN_MINUTES, train_predictions)

## [1] 71.70014

# Test MAE
mae(test_data$ACTUAL_WORK_IN_MINUTES, test_predictions)

## [1] 70.8909

Model 2: Random Forest

# Fit Random Forest model
rf_model <- randomForest(ACTUAL_WORK_IN_MINUTES ~ PLANT_ID + PRODUCTION_LOCATION + EXECUTION_START_DATE + ACTUAL_START_TIME + MAINTENANCE_TYPE_DESCRIPTION, data = train_data, ntree = 50)

# Predict on the training data
train_predictions <- predict(rf_model, newdata = train_data)

# Predict on the test data
test_predictions <- predict(rf_model, newdata = test_data)

# Train RMSE
rmse(train_data$ACTUAL_WORK_IN_MINUTES, train_predictions)

## [1] 292.8442

# Test RMSE
rmse(test_data$ACTUAL_WORK_IN_MINUTES, test_predictions)

## [1] 181.6144

# Train MAE
mae(train_data$ACTUAL_WORK_IN_MINUTES, train_predictions)

## [1] 68.32546

# Test MAE
mae(test_data$ACTUAL_WORK_IN_MINUTES, test_predictions)

## [1] 68.04513

Model 3: Regression Tree with Text Data

# regression tree model
tree_model <- rpart(ACTUAL_WORK_IN_MINUTES ~ ., data = train_data)

# Predict on the training data
train_predictions <- predict(tree_model, newdata = train_data)

# Predict on the test data
test_predictions <- predict(tree_model, newdata = test_data)

# Train RMSE
rmse(train_data$ACTUAL_WORK_IN_MINUTES, train_predictions)

## [1] 283.0599

# Test RMSE
rmse(test_data$ACTUAL_WORK_IN_MINUTES, test_predictions)

## [1] 201.2538

# Train MAE
mae(train_data$ACTUAL_WORK_IN_MINUTES, train_predictions)

## [1] 70.63337

# Test MAE
mae(test_data$ACTUAL_WORK_IN_MINUTES, test_predictions)

## [1] 71.04794

Results

In conclusion, while the three models—linear regression, random forest, and regression tree—provide a useful starting point for predicting ACTUAL_WORK_IN_MINUTES, their performance is somewhat limited. Although the random forest model outperforms the others, the overall accuracy remains relatively low, with test RMSE values in the 180-205 range and MAE between 67 and 72. These models offer value in identifying work orders that are likely to take longer to complete, but they may not be precise enough for detailed operational planning. One key limitation is the relatively small set of features used in the models, particularly the reliance on basic categorical and temporal variables. Additionally, the models may benefit from further feature engineering, such as incorporating more detailed historical data, maintenance patterns, or machine learning techniques like gradient boosting or neural networks. Future analysis could explore tuning the models for better performance, including optimizing hyperparameters or applying more advanced algorithms.

Group Member Contributions

This project was a collaborative effort. For the first few weeks of the assignment, we each worked individually on our own complete modeling notebook. In the final week before the due date, we combined and organized our results into one notebook. We didn’t assign individual tasks, but each of us decided to approach the business problem from a different angle. Ahsan worked on a survival analysis model. Scott and Charlie focused on modeling/forecasting the number of orders and total downtime. Rylan worked with the text data to try to model individual work order downtimes.

Group Modeling Assignment

Ahsan Ahmad, Scott Silverstein, Rylan Tribush, Charlie Vandel

2024-11-03

Introduction

Project Goal

Business and Analytic Problems

Purpose of the Notebook

Data Preparation

Data Cleaning for Survival Analysis

Data Cleaning for Quarterly Predictive Model

Data Cleaning for Work Order Time Model

Survival Analysis Model

Identify First Failures

Identify Censored Equipments

Combine Failure and Censored Data

Create Survival Object

Kaplan-Meier Estimator

Interpretation of Kaplan-Meier Survival Curve

Survival Analysis by Equipment Type

Survival Analysis by Equipment_IDs

Cox Proportional Hazards Model

Model Summary

Check Proportional Hazards Assumption

Time Series Analysis of Failures

Conclusion

Quarterly Predictive Model

Define Target Features

Split data into Training and Testing Sets

Train Random Forest

Evaluate Performance

MAE

RMSE

Feature Importance

Model 2

Log Transformation

Interaction term

Polynomial

Features

Split Data

Models

Predictions

Performance

MAE

Feature Importance

Feature Importance By Quarter

Interpretation

Relevance to Stakeholders

Overall Value

Work Time Model

Model 1: Linear Regression

Model 2: Random Forest

Model 3: Regression Tree with Text Data

Results

Group Member Contributions