sofia wolfson: Improving the Predicting Modeling Selection Process Using Lean Tools and Methods

Introduction:

This investigation looks at the process of creating a linear regression model in R Studio for

a random data set. The process name, mission and definitions are identified, followed by a flowchart and metrics. Then, different Deming-based Lean Six Sigma tools are used, namely, 5S, Total Productivity Maintenance, Quick Changeovers (SMED) and Mistake Proofing (Poka Yoke). Theoretical areas for improvement and methodology are mentioned to optimize and improve the key CTQs: cycle time, number of errors and run time of the entire process.

I. Naming the process and describing its mission

Process Name: Identifying and applying a regression model to any data set with a continuous response variable

Process Mission: Finding a best fit model by eliminating unnecessary steps, reducing complexity of processes and decreasing the amount of time to find such model.

II. Mission of the process

Mission Statement: Improving the process of my statistical analysis to become a better business analyst

III. Flowchart and dashboard of the process’ objectives and metrics

Dashboard

Strategy: Lean Six Sigma Tools and Methods for Process Improvement

Objectives

Reducing the time between accessing the data set and creating a best fit model

Reducing the number of errors in the code

Reducing the number of steps needed to obtain the optimal model

Metric

Cycle time from start to end

The number of times the program says error

The number of lines of code needed to find the final model

Reducing the amount of time it takes to run all of the code

After completing the code, the number of seconds it takes to produce a document from the script

Model Selection Flowchart

IV. Operationally Define Each Metric

CTQ Definition Definition of a defect Opportunity for defects

Cycle Time	Cycle time is defined by the amount of time between opening a clean data set and finding the best fitting model for such data set.	For an efficient process it should take < 2 hours to find a model for a data set with 20 variables. Every 10 more additional variables adds about 0.5 hours of cycle time. Optimal cycle time (in hours) is therefore given by the following formula: Cycle Time ≤ 2 + (n – 10)*0.05, where n is the number of variables in the data set A cycle time that does not satisfy the equation above is a defect.	• Erroneous code • Inefficient code
Number of Errors	Number of times the output of a line of code says error (see figure 1 below for example).	For an efficient process the script should not include any errors. The code to make a model should be applicable to any data set without error. Number of Errors = 0 Any error is a defect in our process.	• Program needs update • Wi-Fi Connection lost • Typing errors
Run Time	Run time is defined by the amount of time it takes to run all the lines of code one after the other to produce the best fitting model.	For an efficient process the program should be able to run all of the code to produce an output in less than 1 minute. An additional minute can be added for every additional 100’000 observations included in the data set. Optimal run time (in minutes) is given by this formula: Run Time ≤ 1 + (n – 100’000)*0.00001 where n is the number of observations in the data set A run time that does not satisfy the equation above is a defect.	• Computer Memory • Extra packages installed in the program • Wi-Fi Speed • Inefficient Code

Figure 1

V. Using 5S, TPM, Quick Changeovers (SMED) and Mistake Proofing (Poke Yoke) to fix our model selection process.

Seiri: Elimination of unnecessary packages and data sets (waste)

Most programs require multiple statistical analysis tools which can be found in add-on

packages which are installed prior to running code. Many of these are unnecessary and downloading them may create a lag in the run time or produce errors in the code. This happens because unwanted packages override commands in packages that are needed to perform analysis resulting in errors and reduced cycle time. Previous data sets already uploaded in the program can also result in higher run times. It is important to identify unnecessary data which is already uploaded in the program.

Figure 2 is the red tagging process for the first ten packages installed. We can remove these packages to increase the speed of our run time.

Figure 3 and Figure 4 shows us the environment with previous data sets, before and after red tagging.

Figure 3 Figure 4

Figure 2

Seiton: Keeping a clean coding space

We can organize the code so that it is easy to read in multiple ways:

(1) Using “chunks” to contain code and naming them according to the step they are on so that they can easily be viewed and accessed by the user. Figure 4 shows lines of code contained in a “chunk” delimitated by the symbols “ ``` ” and “ ``` ”. The chunk of code is named ‘r histograms’ to indicate that we are at Step 5 of our process.

Figure 4

(2) Using the same format for the entire code. In our program the arrow symbol (<-) and equal sign are interchangeable (=). However, it is much easier to read code when we use a standardized symbol system. In Figure 5, even though all the lines have the same function, we can see that lines 4 and 5 have a much cleaner and organized look than lines 1 and 2. We want all lines to be standardized like lines 4 and 5 so that it is much easier to spot mistakes and read the code.

Figure 5

Seiso: Cleaning our workspace (laptop)

The biggest factor that affects speed is RAM (Random Access Memory), a laptops short

term memory. When a lot of applications are open on a computer, more of its short term memory is used, slowing down the overall performance. We want the majority of the memory to be focused on the app we are using. To clear memory, in phase 1, we can access the short term memory through our systems manager and click the big X at the top circled in Figure 6. This reduces one of our opportunities for defect in the run time mentioned above. In phase 3, we can proceed to clean the program itself after cleaning our computer, as shown in Figure 7, by using the broom function circled in red. This eliminates all slow-down from previous build up in the program and eliminates the problem from the root cause.

Figure 6

Figure 7

Seiketsu: Developing best practices

The processes above can be automated, rather than assigning the responsibility to the

analyst at the beginning of the process. To clear computer memory, shutting down the computer rather than putting it to sleep will clear all RAM memory. Upon shutting down, the computer programs provide an option to either “save” or “don’t save” the “workspace image” (seen in Figure 8). By clicking the option “don’t save”, one can prevent clutter from forming for future use. If these two best practices are implemented after every use, then we standardize the process.

Figure 8

Shitsuke: Self discipline

Through the automation of the process above we can reduce the amount of work needed to

be done by each analyst. The reminders to “shut down” and “save workspace image” come automatically when a file is closed or unattended for a long time, providing an extrinsic reminder to clean the workspace. Hopefully, returning to a clean workplace will provide intrinsic motivation and reminder at the beginning of each session to continue shutting down and clearing the workspace image moving forward.

Total Productivity Maintenance

Jishu Hozen

Operators of the code can be more involved with finding the optimal model by learning and understanding what each step means. Through this, coders are better equipped to diagnose the errors when they occur. There are many steps involved in the flowchart and knowing why one comes after the other and how each one relates is very important. By using Jishu Hozen, one could run each individual section out of order to see if each part works and to learn more about how each relates to the other. If one line of code does not work, the analyst will be able to understand why, rather than focusing on the whole process to figure out the problem.

(1) Breakdown Maintenance

We do not want breakdown maintenance to occur because it will increase the cycle time of the process. However, steps should be identified to quickly run through this type of maintenance. The table below tackles the first five reactions one should have to an error in the code.

1 Check for syntax error in the code
2 Check for missing packages needed to run the code
3 Check for missing data
4 Check for vector distances
5 Check for previous examples of the error

(2) Preventative Maintenance

The operator of our process should be up to date with the most common errors found in regression analysis. This can be done by visiting websites such as stackoverflow.com and looking for the most frequent errors. A statistical analysis of errors found in these programming forums can be done to identify and educate people on them. This will help the operator avoid them and fix them during breakdown maintenance.

(3) Corrective Maintenance

Many times there are too many errors compiled in the code and one must start from scratch. One should always clean their workspace using techniques mentioned in the Seiso section. Then one can proceed to eliminate the latest line of code until the program runs again. Sometimes, one will return to a blank page and will have to restart the process entirely. This will return the system to an operational condition. Removing one line at a time is more time consuming but the problem can be identified without the need of eliminating all the work that has been done.

Quick Changeovers (SMED)

The majority of activities in regression analysis are internal, meaning that they occur when the machine is stopped. This is because computer run time is usually very short, and each individual line of codes takes seconds or even milli-seconds to run. The only time where the machine is working for a long time is in our second to last step (number 3 in the table below), when we fit the model with missing data. Rather than having the operator remaining idle during this time we can have them look deeper into what the missing variables are, adding that one minute from external to internal so that both the analysts and the machine are working at the same time.

		Current Time		Improvement	Proposed Time
Number	Task/Operation	Internal	External	Improvement	Internal	External
1	Receiving the data set	2 minutes	1 second	Use the 5S to eliminate the need to remove packages before use	0.5 minutes	1 second
2	Statistical Analysis (each step after 1)	3.5 minutes per step (30 steps)	~ 5 seconds	Perform function checks and periodic preventive maintenance	3.5 minutes per step	~ 3 seconds
3	Fitting model with omitted data	1 minute	1 minute	Further understand missing variables by looking at what they are	2 minutes	1 minute
	Current Total	108 minutes	3.6 minutes	Improved Total	107.5 minutes	1.5 minutes

Mistake Proofing (Poka Yoke): Contact Method

We can easily use the contact method for many of our process systems. For example, one

can only proceed to a certain step only after having completed the previous one, otherwise the program will not compute the calculation. This contact check for errors is built within the machine. For example, if one does not receive the data, one cannot continue to check for missing data. Another more complicated example is performing certain statistical analyses with outliers. Results will not be significant if one keeps the outlier and they will not be able to provide the final product, a significant predictive model. The built-in mistake proofing in the program reduces the variation that comes with the analyst making sure that everything is correct.

Having a template with checks after the code or even encouraging copy and pasting from the template can reduce the amount of mistakes done by the analyst. Following the same template allows for an easier flow of the process and reduces overall cycle time. A template can be perfected over time as the process manager understands what is closer to their preference and works best for them.

VI. Conclusion

We applied different tools to make our process leaner and to improve the three main CTQ’s: cycle time, error and run time. Many of these tools are versatile and tackle similar areas of improvement. An important implication I learned is that these tools can be universally applied on both the internal processes (the analyst) and external processes (the computer), to achieve the most efficient and simple process.

sofia wolfson

Pages

Wednesday, December 8, 2021

Improving the Predicting Modeling Selection Process Using Lean Tools and Methods