Root cause analysis in manufacturing – what is it all about and why we should be using it more often
Root cause analysis in manufacturing is the process through which we can find the true cause of failure (the right machine that occurred the problem on the production line). If you want to increase efficiency in a factory it is critical to analyze and find the true cause of the stoppages. However, this is a very time-consuming process, especially when the amount of data produced with every second is significant. With increased automation and Industry 4.0, the sensorization of manufacturing environments has expanded, and so has the amount of data available. Many maintenance managers only address the symptoms of a problem, without addressing the root problem – this leads to recurring issues that would otherwise be easy to solve. When a production line consists of multiple machines and the production steps are complex, it is important to be able to trace problems directly to their source. This approach is well known and used in the concept of LEAN manufacturing. Let us explain how we manage to monitor and find the root cause of production failures.
PackOS root-cause analysis engine
Finding a root cause of the line-level stop in a chain of machines requires deeper knowledge about the statuses of individual machines, but brings into light very valuable insights.
We studied many different lines, machines and production processes in order to create a powerful root-cause analysis engine in PackOS. Now, we will share the knowledge on how it works in PackOS and why it is so crucial for optimizing the production process . Below we describe various downtime types of machines, and how this affects the algorithms for root cause search. For this mechanism to work properly, you will need to make sure that states of machines (like Holdup, Idle or Failure) in PackOS are classified correctly, and reflect the real-life situation.
Here you can read more about PackOS – the production monitoring software
All of the machines defines two properties for each stop:
- What is the stop on the machine itself
- What is the potential root cause
Use real-time data to detect root cause of ineffectiveness and increase production output
Let us create a digital twin of your factory.
No root cause
First, note the following states cannot have a root cause, or be the root cause:
- Work
- Changover
- Off
- Inactive
- NoLiveData
If any such states is encountered, it’s skipped and the search continues
Internal root cause
Failure state is usually directly identified by the sensor on the machine itself. It means that something prevents the machine from normal operation (e.g. a fault of some part). Such downtime becomes a potential root cause for this or other machines.
Lack of Components (also called Material Shortage) state is usually directly identified by the sensor on the machine itself. And does not search for an external root cause. This stop means that the machine cannot produce because some component (e.g. caps) is missing. Such downtime becomes a potential root cause for other machines.
Stop by operator state is a result of manual operator intervention in machine operation. It is usually done on purpose (e.g. to solve an ongoing failure). That’s why, when PackOS spots a pattern – a Failure followed immediately by a Stop by operator, the former becomes the root cause of the latter.
Note a different behavior on the line level:
Because it’s important information whether the line is waiting in a Failure, or is Stop by operator while the failure is being investigated, the line will show both statuses on the line-level. Instead of blindly following the root cause like for any other case

External root cause
Holdup & Idle on the machine signify that the root cause is outside of the machine.
- Holdup – would appear if the machine cannot operate because the output has been blocked (usually by queued products). Indicate that the root cause is downstream from the machine.
- Idle – would appear if there is no input into the machine (e.g. no bottles at the entry of the filler). Indicate that the root cause is upstream from the machine.
Note the difference between ‘Idle’ and ‘Lack of Components’.
The first looks for an external problem (on a different machine in the flow), while the second assumes it’s a direct infeed to the machine (not explicitly monitored as a separate machine), and does not look for an external root cause.
There are two critical pieces of data for a successful root cause search:
Read also about: How to calculate OEE (PackOS calculation examples).
1. Graph of connected machines
Signifies the active set of machines and connections between them. Only the machines in the active flow will be searched for the root cause. And the set of connections between them will determine the direction of the search.
The flow can be controlled either by a SKU line configuration, which will adjust the flow after order start:

Or by a pre-processing function Pre-Processing Functions which can trigger a SetFlow command.
Or by a manual adjustment in settings:
2. In-sync history of machine downtimes (root cause failure analysis)
Let’s analyse the root cause search for a Holdup. Idle works exactly the same, but in the other direction.
To find a root cause for a Holdup, we will analyse all stoppages for machines downstream, and look for stoppages which can be classified as root cause (described above). All the other states are “transparent”. There is no “bouncing” the other way: if we are looking for the cause of a holdup, we never look at the machines before the Base Availability machine even if one of the downstream machines is in the Idle state.

If there was no breakdown at this point of time, the Holdup state would not find any root cause.

In order to take into account full buffers / shorter reaction times, we will take into consideration the entire period starting from the maximum reaction time (x) and ending in the moment when a stoppage began – or even a while after the stoppage started (y) which would let us spot a situation when a driver did not “report” a breakdown as fast as it should have:

You can define ‘X’ and ‘Y’ for each machine in machine settings:

The whole period (x+y) is analysed, and the first event in this period of time will take precedence and become a root cause:

In a string of many machines, it is vital to observe the relative delay of stoppages from the beginning of the observed reaction time (marked in pink +1s/+5s). Using that relative time, it is possible to indicate the stoppage that began first as the root cause because a stoppage that occurred relatively later could be a consequence of the problem rather than its cause.

The time that has passed starting from the set “reaction time” is more important than the order of machines:

Specifying buffer delays is especially important when working with long lines. In that case root causes can take some time to propagate through machines.

You can see the period of time searched in the Work Spectrum view, marked with dark bars on neighbouring machines
