Context: we analyze economic changes in Colombian municipalities between 1993 and 2020.

We need to download ArcGis, a geographic information system (GIS) for working with maps and geographic information maintained by the Environmental Systems Research Institute (Esri). It is used for creating and using maps, compiling geographic data, analyzing mapped information, sharing and discovering geographic information, using maps and geographic information in a range of applications, and managing geographic information in a database. Particularly important for the purpose of this post is ArcMap, one of the applications, which is primarily used to view, edit, create, and analyze geospatial data. In the absence of license for this software, you can use free software such as GVSig.

Context: we analyze economic changes in Colombian municipalities between 1993 and 2020.

In this post, we focus on how to download our “raw data”, i.e. the nighttime light data for Colombia.

Step 1: Download nighttime light series from NOAA National Centers for Environmental Information (NCEI). Particularly Version 4 DMSP-OLS Nighttime Lights Time Series (DMSP) – The DMSP annual composite data contain average radiance values of cloud-free coverages, reflecting the persistent lights from cities, villages, and roads, with a spatial resolution of about 900m, and a temporal coverage of 1992 to 2013 – and VIIRS data which is available from 2013 on and with is a finer spatial resolution of 450m approximately. We will later do a post on the particularities of VIIRS, here we focus on DMSP.

Figure 1. The green line represents an overfitted model and the black line represents a regularized model. While the green line best follows the training data, it is too dependent on that data and it is likely to have a higher error rate on new unseen data, compared to the black line.

Figure 2. Noisy (roughly linear) data is fitted to a linear function and a polynomial function. Although the polynomial function is a perfect fit, the linear function can be expected to generalize better: if the two functions were used to extrapolate beyond the fit data, the linear function would make better predictions.

In statistics, overfitting is “the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably”.^{[1]} An overfitted model is a statistical model that contains more parameters than can be justified by the data.^{[2]} The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e. the noise) as if that variation represented underlying model structure.^{[3]}^{:45}

Underfitting occurs when a statistical model cannot adequately capture the underlying structure of the data. An underfitted model is a model where some parameters or terms that would appear in a correctly specified model are missing.^{[2]} Underfitting would occur, for example, when fitting a linear model to non-linear data. Such a model will tend to have poor predictive performance.

Overfitting and underfitting can occur in machine learning, in particular. In machine learning, the phenomena are sometimes called “overtraining” and “undertraining”.

The possibility of overfitting exists because the criterion used for selecting the model is not the same as the criterion used to judge the suitability of a model. For example, a model might be selected by maximizing its performance on some set of training data, and yet its suitability might be determined by its ability to perform well on unseen data; then overfitting occurs when a model begins to “memorize” training data rather than “learning” to generalize from a trend.

As an extreme example, if the number of parameters is the same as or greater than the number of observations, then a model can perfectly predict the training data simply by memorizing the data in its entirety. (For an illustration, see Figure 2.) Such a model, though, will typically fail severely when making predictions.

The potential for overfitting depends not only on the number of parameters and data but also the conformability of the model structure with the data shape, and the magnitude of model error compared to the expected level of noise or error in the data.^{[citation needed]} Even when the fitted model does not have an excessive number of parameters, it is to be expected that the fitted relationship will appear to perform less well on a new data set than on the data set used for fitting (a phenomenon sometimes known as shrinkage).^{[2]} In particular, the value of the coefficient of determination will shrink relative to the original data.

To lessen the chance of, or amount of, overfitting, several techniques are available (e.g. model comparison, cross-validation, regularization, early stopping, pruning, Bayesian priors, or dropout). The basis of some techniques is either (1) to explicitly penalize overly complex models or (2) to test the model’s ability to generalize by evaluating its performance on a set of data not used for training, which is assumed to approximate the typical unseen data that a model will encounter.