# Data 447425 Mechanical 54 $ 168,183,744 $ 15,518,911

Data Collection

The

datasets used in this research are collected from two previous studies, Leonard

model (1988) 6 and Moselhi et al., (2005) 8. 123 data points are generated by

combining these two datasets that can be considered sufficient data in

quantification loss of productivity domain and Table 1 shows the distribution

of the combined dataset.

Table 1: Distribution of Combined Dataset

Type of Projects

Number of

CO’s

Value of

Original Contract

Value of

Change Orders

Original

Estimated Hrs.

Actual Hrs.

Cos Hrs.

Electrical

37

$

91,984,837

$ 42,530,607

1395330

2324107

447425

Mechanical

54

$ 168,183,744

$ 15,518,911

1815085

2878130

427145

Architectural

5

$

6,410,000

$

914,273

95280

128787

17116

Mech./Elec.

5

$

30,552,000

$ 6,452,000

883430

1190742

143650

Civil

22

$

42,538,755

$

9,323,214

691136

1161878

190958

Grand Total

123

$ 339,669,337

$ 74,739,006

4880263

7683645

1226294

Research Methodology

The

developed model for data non-linear regression has several steps. First step is

data preprocessing and enhancement, then use the refined data for feeding into the

developed nonlinear regression model. The last step is to compare and report the

generated results of the developed model with other existing models against a

case study. Figure 4 shows the general overview of the developed model.

Figure 5: General Overview of Developed Model

Data

Preprocessing and Enhancement

The

combined dataset has 14 unique parameters with diverse types and scales, namely

type of impact, type of work, original duration, actual duration, extended

duration, original estimated hours, earned hours, actual hours, number of

change orders, frequency, change hours, schedule performance index, average

size, and % of change orders. The values associated with these parameters are

not comparable since they are not aligned. Thus, the process of aligning the

dataset starts off by reordering the big values in the dataset such as actual

and original estimated hours. The pseudocode associated with the aligning

process is as the following.

Table 2: Algorithm for Pseudocode for Aligning

the Given Dataset

input = dataset;

int ratio = 100;

int aspect_ratio =

1.25;

m, n = input.size();

for (int i=0; i

foreach(tuplei)

itemi, j = itemi,

j /max(item:,j);

In

Table 2, the aspect ratio is set to 1.25. This value is achieved by grid search

methodology and is dependent to the given input dataset and by changing the

given input, this value should be updated as well.

As

a second step for enhancing our dataset, an augmented apriori-like algorithm is

used to maximize the margin around the features especially the ones which are

so close to each other in terms of value. This algorithm firstly finds the

local and global extremum values for scaling the records up. Then, assumes

that, there are arrows drawn from origin to the records with respect to the

extremums. The functionality of these extremums for arrows is setting a knot

that nonlinearly bias them. In other words, the values will be mapped to

another space all the records are represented by arrows and knots. Finding the

maximum margins between these arrows is an easier job by solving the Jacobian matrix.

After all these computation, some hanger values will be generated that their

tensor product with the original records will maximize their cartesian distance

and finally will help the regression algorithm to tune its parameters. Specifically,

for the records with percentage values such as extended duration feature, the

centroid corresponding is computed to all the values along that feature. For

this aim, the gaussian distribution approximation is utilized to find the best

statistical expectation and ideally set to zero and looking for the proper

standard deviation (SD). Finally, 0.25 is reached as mean value and 1.24 as the

SD value. If it is assumed that each row of the dataset is a 14-D vector in

non-cartesian space, then should be able to find its basis vector using

algebraic theorem like Cholesky. The given rank of this factorization will give

us the degree of the nonlinear 6 degree of freedom (6-DOF) to be solved by

Jacobian. Finally, each row is not consistent can be replaced to the mean of

the rows, by the approx. 6-DOF polynomial. For the current dataset, this

technique is applied for tuple 64, 57, 87, and 110.

Nonlinear Regression

There

are several ways for finding a polynomial curve that represent the data as

smooth as possible. Though linear regression is a fast and accurate method for

balanced and normalized dataset as created in the previous section, its

functionality varies from dataset to dataset. The simple following rule is for

the processed dataset:

Equation

1

Where

denotes

the hypothesized line that we would like to achieve it and is the given input. Based on the achieved

results, the RMSE associated with this algorithm was about 21.32% which is

quite high. The next step after linear regression was its nonlinear

counterpart. The common approach for handling nonlinear regression is

approximating it by piecewise linear function. In other words, since in

nonlinear regression the achieved function is no longer a line, thus

nonlinearity will be implemented by several linear functions. Regarding our

implementation, this approach will result in the RMSE value of 17.34%.

The

nonlinear regression can be articulated by formulating the nonlinearity with

bunch of nonlinearity. Firstly, the dataset is patched into seven 2×2 patches

(this is the total number of given features in the dataset) and assigned a

nonlinear sigmoid-like function on each of them. Formula (2) depicts this

nonlinear function.