COVID-19 Data Modelling
This the modelling assignment. The thing that will be modeled is the COVID-19 growth in Indonesia. Hence, the date in the future can be predicted during the pandemic by utilizing the basic model which is found through this assignment.
The data is based on the following URL:
https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide |
The country which is chosen to do the modelling is Indonesia. Such data contains about 101 rows. However, as the existing cases starts from 3/2/2020, the number of rows which is used to do the modelling is only 37 rows.
Data Exploration
while exploring the data, I found there is the non-sequence data. Therefore, I sort the data based on the sequence date in order to obtain the better growth. The following code is the data exploration code for Indonesia case.
#read directory os.chdir('D:/George Mason University/Semester 2/Analytics Big Data to Information/Assignment 8') os.getcwd() #Import the data1 data1 = pd.read_csv('COVID-19-geographic-disbtribution-worldwide-2020-04-16.csv', sep = ',') data1 data1.head(10) data1.info() data1['countriesAndTerritories'] #setdata1 Indonesia data1_indonesia = data1.query('countriesAndTerritories == "Indonesia"') len(data1_indonesia.index) data1_indonesia2 = data1_indonesia.sort_values(by='dateRep') data1_indonesia2.info() data1_indonesia3 = data1_indonesia2[['dateRep', 'day', 'month', 'year', 'cases', 'deaths', 'popData2018', 'countriesAndTerritories']] data1_indonesia3.head(50) #convert to array for sorting dateRep1 = np.array(data1_indonesia3['dateRep']) day1 = np.array(data1_indonesia3['day']) month1 = np.array(data1_indonesia3['month']) year1 = np.array(data1_indonesia3['year']) cases1 = np.array(data1_indonesia3['cases']) deaths1 = np.array(data1_indonesia3['deaths']) popData1 = np.array(data1_indonesia3['popData2018']) countriesAndTerritories1 = np.array(data1_indonesia3['countriesAndTerritories']) ############################################################################### #sort based on year ############################################################################### i = 0 while (i < (len(data1_indonesia3) - 1)): j = 0 while (j < (len(data1_indonesia3)-i-1)): if (year1[j] > year1[j + 1]): temp1 = year1[j] year1[j] = year1[j + 1] year1[j + 1] = temp1 temp2 = day1[j] day1[j] = day1[j + 1] day1[j + 1] = temp2 temp3 = month1[j] month1[j] = month1[j + 1] month1[j + 1] = temp3 temp4 = cases1[j] cases1[j] = cases1[j + 1] cases1[j + 1] = temp4 temp5 = deaths1[j] deaths1[j] = deaths1[j + 1] deaths1[j + 1] = temp5 temp6 = dateRep1[j] dateRep1[j] = dateRep1[j + 1] dateRep1[j + 1] = temp6 temp7 = popData1[j] popData1[j] = popData1[j + 1] popData1[j + 1] = temp7 j = j + 1 i = i + 1 ############################################################################### #sort based on month ############################################################################### i = 0 while (i < (len(data1_indonesia3) - 1)): if (year1[i] == year1[i + 1]): print("same year") j = i; while (j < (len(data1_indonesia3) - 1)): k = j while (k < (len(data1_indonesia3)-i-1)): print('sorting process') if (month1[k] > month1[k + 1]): temp1 = year1[k] year1[k] = year1[k + 1] year1[k + 1] = temp1 temp2 = day1[k] day1[k] = day1[k + 1] day1[k + 1] = temp2 temp3 = month1[k] month1[k] = month1[k + 1] month1[k + 1] = temp3 temp4 = cases1[k] cases1[k] = cases1[k + 1] cases1[k + 1] = temp4 temp5 = deaths1[k] deaths1[k] = deaths1[k + 1] deaths1[k + 1] = temp5 temp6 = dateRep1[k] dateRep1[k] = dateRep1[k + 1] dateRep1[k + 1] = temp6 temp7 = popData1[k] popData1[k] = popData1[k + 1] popData1[k + 1] = temp7 k = k + 1 j = j + 1 else: print("different year") i = i + 1 ############################################################################### #sort based on day ############################################################################### i = 0 while (i < (len(data1_indonesia3) - 1)): if (year1[i] == year1[i + 1]): print("same year") j = i; while (j < (len(data1_indonesia3) - 1)): k = j while (k < (len(data1_indonesia3)-i-1)): if (month1[k] == month1[k + 1]): print("same month") l = k while ((l < (len(data1_indonesia3)-i)) and (month1[l] == month1[l + 1])): if (day1[l] > day1[l + 1]): temp1 = year1[l] year1[l] = year1[l + 1] year1[l + 1] = temp1 temp2 = day1[l] day1[l] = day1[l + 1] day1[l + 1] = temp2 temp3 = month1[l] month1[l] = month1[l + 1] month1[l + 1] = temp3 temp4 = cases1[l] cases1[l] = cases1[l + 1] cases1[l + 1] = temp4 temp5 = deaths1[l] deaths1[l] = deaths1[l + 1] deaths1[l + 1] = temp5 temp6 = dateRep1[l] dateRep1[l] = dateRep1[l + 1] dateRep1[l + 1] = temp6 temp7 = popData1[l] popData1[l] = popData1[l + 1] popData1[l + 1] = temp7 l = l + 1 else: print("different month") k = k + 1 j = j + 1 else: print("different year") i = i + 1 ############################################################################### #New Data Frame for setdata1 Indonesia ############################################################################### data1_indonesia4 = pd.DataFrame({'dateRep' : dateRep1, 'day' : day1, 'month' : month1, 'year' : year1, 'cases' : cases1, 'deaths' : deaths1, 'popData2018' : popData1, 'countriesAndTerritories' : countriesAndTerritories1}) data1_indonesia4 = data1_indonesia4.query('cases > 0').reset_index(drop=False).reset_index(drop=False) #data1_indonesia4 = data1_indonesia4[['level_0', 'dateRep', 'day', 'month', 'year', 'cases', 'deaths', 'popData2018']] data1_indonesia4 = data1_indonesia4[['level_0', 'dateRep', 'day', 'month', 'year', 'cases']] #data1_indonesia4.columns = ['Timestep', 'dateRep', 'day', 'month', 'year', 'cases', 'deaths', 'popData2018'] data1_indonesia4.columns = ['Timestep', 'dateRep', 'day', 'month', 'year', 'cases'] data1_indonesia4.head(16) |
create a basic model of the growth of the virus in that country
prior to decide the best basic model, the summary of the model should be performed at first in order to check the R-Square value. In this case, since the R-Square value has the best value which almost approach to 1, and already been proved by analyzing the diagnostic plot, so the model seems to be the best basic model. The model summary is shown by the following summary:
OLS Regression Results ============================================================================== Dep. Variable: cases R-squared: 0.857 Model: OLS Adj. R-squared: 0.853 Method: Least Squares F-statistic: 215.0 Date: Sun, 19 Apr 2020 Prob (F-statistic): 9.38e-17 Time: 00:21:18 Log-Likelihood: -194.53 No. Observations: 38 AIC: 393.1 Df Residuals: 36 BIC: 396.3 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const -31.6437 13.224 -2.393 0.022 -58.464 -4.823 Timestep 9.0163 0.615 14.662 0.000 7.769 10.263 ============================================================================== Omnibus: 7.199 Durbin-Watson: 1.538 Prob(Omnibus): 0.027 Jarque-Bera (JB): 6.598 Skew: 0.645 Prob(JB): 0.0369 Kurtosis: 4.582 Cond. No. 42.2 ============================================================================== |
The following code is used to perform the summary above:
X = data1_indonesia4.Timestep X = sm.add_constant(X) y1 = data1_indonesia4.cases model1 = sm.OLS(y1, X) model_fit1 = model1.fit() print(model_fit1.summary()) |
The diagnostic plot is performed while analyzing the model. The diagnostic plot is shown through the following picture.
According to the four diagnotic plot above, it can be concluded that the basic model is the best fit for the data. The first plot, which is the Residuals vs Fitted, shows that the model does not violate the linear assumtions since the red line still adjacent to the dash line. Also, the spread residuals still surrounds the red line without many distinct patterns of the red line.
The second plot shows a little violations to the normality assumptions since there are some residuals with the extreme values than would be expected. However, even though the extreme data are omited, that would not make a lot of change since all the data are beyond the cook’s distance as shown through plot number 4, the residuals vs leverage.
The third plot shows that the red line does not perfectly appear to be horizontal. However, as long as the red line is not the same as y = 0, it does not violate the equal variance.
The basic model of the residuals is y = (-31.6437) + 9.0163X. The declaration code of such model is shown through the following:
#basic model function = #y = (-31.6437) + 9.0163X def linear_predictions(t): return (-31.6437) + (9.0163 * t) |
The actual data and the predicted data shown by the following:
Timestep dateRep day month year cases PredictedCases 0 0 3/2/2020 2 3 2020 2 -31.6437 1 1 3/7/2020 7 3 2020 2 -22.6274 2 2 3/9/2020 9 3 2020 2 -13.6111 3 3 3/11/2020 11 3 2020 13 -4.5948 4 4 3/12/2020 12 3 2020 15 4.4215 5 5 3/14/2020 14 3 2020 35 13.4378 6 6 3/15/2020 15 3 2020 27 22.4541 7 7 3/16/2020 16 3 2020 21 31.4704 8 8 3/17/2020 17 3 2020 17 40.4867 9 9 3/18/2020 18 3 2020 38 49.5030 10 10 3/20/2020 20 3 2020 55 58.5193 11 11 3/21/2020 21 3 2020 82 67.5356 12 12 3/22/2020 22 3 2020 141 76.5519 13 13 3/23/2020 23 3 2020 64 85.5682 14 14 3/24/2020 24 3 2020 65 94.5845 15 15 3/25/2020 25 3 2020 107 103.6008 16 16 3/26/2020 26 3 2020 104 112.6171 17 17 3/27/2020 27 3 2020 103 121.6334 18 18 3/28/2020 28 3 2020 153 130.6497 19 19 3/29/2020 29 3 2020 109 139.6660 20 20 3/30/2020 30 3 2020 130 148.6823 21 21 3/31/2020 31 3 2020 129 157.6986 22 22 4/1/2020 1 4 2020 114 166.7149 23 23 4/2/2020 2 4 2020 149 175.7312 24 24 4/3/2020 3 4 2020 113 184.7475 25 25 4/4/2020 4 4 2020 196 193.7638 26 26 4/5/2020 5 4 2020 106 202.7801 27 27 4/6/2020 6 4 2020 181 211.7964 28 28 4/7/2020 7 4 2020 218 220.8127 29 29 4/8/2020 8 4 2020 247 229.8290 30 30 4/9/2020 9 4 2020 218 238.8453 31 31 4/10/2020 10 4 2020 337 247.8616 32 32 4/11/2020 11 4 2020 219 256.8779 33 33 4/12/2020 12 4 2020 330 265.8942 34 34 4/13/2020 13 4 2020 399 274.9105 35 35 4/14/2020 14 4 2020 316 283.9268 36 36 4/15/2020 15 4 2020 282 292.9431 37 37 4/16/2020 16 4 2020 297 301.9594 |
Other than that, the illustration of the basic model is such the follow:
The potential limitations of the model are the extreme spike of the data can’t be predicted. Other than that, as shown in the second plot of the diagnostic plot, the data model does not perfectly match the residuals since the head and the tail does not perfectly match the diagonal line.