Spatial Patterns of Demographics and Assets in Sierra Leone
Isaiah Lyons-Galante, Caleb Schmitz
GEOG 5023 Advanced Quantitative Methods in Geography
Final Project Report Spring 2023
1. Introduction
Sierra Leone is a small country in western Africa with a population of just over 8 million. Despite recent political stability and economic growth, Sierra Leone has the fifth lowest Gross Domestic Product per capita (World Bank, 2021). A lack of infrastructure and access to essential services like health care, education, and electricity are often cited as key contributors to this persistent wealth gap (United Nations, n.d.). To help combat this development projects have focused on infrastructure development such as rural electrification (UNOPS, 2018) . In order to be effective, these projects need to be targeted to areas most in need.
However, reliable data on economic livelihoods and electricity access remain scarce in the developing world, hampering efforts to study these outcomes and to design policies that improve them (Jean et al. 2016). In particular, it is critical to understand the baseline socioeconomic status of Sierra Leone in order to quantify the impact infrastructure projects could have on poverty reduction. This information is critical for decision-makers to decide where to allocate scarce public resources.
This research will address these challenges in three ways. First, it will begin to describe how household wealth and electricity vary spatially across Sierra Leone. Next, we will aim to develop models to predict the wealth of a household based on its demographics, assets, construction, and geographic location. These models will examine how the relationships of predictors and targets vary spatially across the country. Finally, we will create a model to spatially interpolate household wealth, such that regions of the country without adequate data availability can still be understood and predicted for easier project implementation.
2. Data and Methods
2.1 Datasets
The United States Agency for International Development (USAID) manages a program called the Demographic and Health Surveys (DHS) that surveys dozens of developing countries every few years on demographics, assets, and public health outcomes (DHS, n.d. a.). Households are surveyed in clusters the size of a few city blocks in urban areas, or a village in rural areas. The geographical coordinates of the cluster are recorded, and then scattered to a random nearby point to protect the privacy of the respondents (USAID, n.d.).
The most recent survey from Sierra Leone is from 2019, and is made publicly available upon request. The survey covers an impressive 13,399 households in 576 clusters across every region of the country. Each household in the survey is tagged to cluster, and has key information such as number of members in the household, ownership of electronics, housing construction materials, and access to electricity. In total, there are about 30 variables that are relevant to demographics, assets, and wealth.
The data can be gathered from the DHS either by API query or by downloading a zip file from the website. With 13,399 rows and 3455 columns, the full dataset is about 60 megabytes. The spatial coordinates need to be downloaded separately, and then joined into the survey results by cluster identification number. The final dataset useful to the analysis are the shapefiles for the country and regions of Sierra Leone, available as well from the DHS (DHS, n.d. b).
2.2 Data Cleaning (Lyons-Galante, Schmitz)
After gathering the data, there were several steps that needed to be taken in order to prepare for the analysis. The first step was to add a spatial component to the data. The cluster coordinates were left joined to the survey responses based on the cluster identification number. The next step was to dramatically reduce the size of the dataset by dropping all columns related to public health and other questions not being addressed in this analysis. This brought the dataset down from 3475 columns to just 64 columns, a 98% reduction in size. A full list of the columns used for this analysis can be found in the Appendix.
The next key step for data cleaning was to find all households with invalid coordinates. Of the 576 spatial clusters, 19 of them had coordinates of “0,0”. After joining this with the household dataset, that left 436 households with invalid coordinates, and so these were filtered from the dataset.
The next key piece of data cleaning was to eliminate responses to questions that were “unknown”, “NA”, or “missing”. Doing this involved converting the data to the appropriate types such as numeric, integer, or factor, and then filtering out the invalid responses.
The final piece of data preparation that was necessary to carry out this analysis was to aggregate the households into clusters and calculate the aggregate statistics. For this, the data was grouped by the latitude and longitude columns to preserve the coordinates, the remaining variables were aggregated into their mean value per cluster.
2.3 Mapping (Schmitz)
The first analysis done is the mapping of the clusters across Sierra Leone overlaid with the national and regional boundaries. The two variables mapped are electricity access as well as mean household wealth index. In order to better visualize the areas, the cluster points are converted to area data using a Voronoi tessellation, or Thiessen polygons. Thiessen polygons is a way of tessellating a plane filled with points such that each region shares the same nearest neighbor. (Longley, 2005). This effectively converts the point data into areas for better visualization and a more straightforward definition of neighbors.
Figure 1: Thiessen polygons surround each of the 500+ survey clusters in Sierra Leone. These polygons allow us to convert point data into area data.
2.4 Regression (Lyons-Galante)
The next phase of the analysis focuses on predicting household wealth index using regression, both spatial and non-spatial. The first model developed as a baseline is the null model that assumes a mean value of the wealth index for all households. The root mean squared error of this model is extracted as a baseline to compare against. The next model developed introduces independent variables such as asset ownership and demographics to predict wealth, but does not yet introduce a spatial component. Lastly, spatially-aware regression is performed, both a spatial lag model as well as a spatial error model. The model performance is compared with the non-spatial model, as well as with each other in order to determine the optimal model for the data.
2.5 Kriging (Schmitz)
The next analysis performed is the prediction of wealth index through kriging. By creating a variogram of the data and then fitting it to a curve, we are able to build a continuous map of wealth index across Sierra Leone. This provides estimates of the wealth index in regions that were not directly surveyed. The sill, nugget, and range of the variogram help us quantify the extent to which space affects the economic status of a household. The performance of this model as compared with the null model also tell us about the spatial nature of wealth distribution.
2.6 Geographically Weighted Regression (Lyons-Galante)
The next analysis returns to regression, but looks at spatial variability with Geographically Weighted Regression (GWR). Developed in 1996 by Brunsdon et al., GWR allows for spatially varying relationships between predictors and the target variable. A miniature regression model is created at each area taking into account just the point and its neighbors. Mapping the coefficients of a given variable across Sierra Leone, we are able to see the strength of the correlation vary in all regions.
2.7 Decision Tree (Lyons-Galante)
The final analysis conducted is a decision tree to predict the wealth index. Though not a spatially explicit model, the decision tree serves a dual purpose – first, given that it is non-parametric, it is able to consider categorical variables without encoding. Second, the decision tree gives a rank of variable importance, which allows us to determine which variables have relatively higher impacts on household wealth. The outcome of this analysis will provide insight into which features are the most significant predictors of wealth, and these relationships will provide valuable information about how to recognize poverty and wealth across Sierra Leone without measuring it directly.
3. Results and Discussion
The results of the analysis are described below. Each subsection focuses on a given model, followed by a discussion of the results.
3.1 Mapping
Below are two maps that show electricity access throughout Sierra Leone as well as the wealth index.
Figure 2: Point maps of electricity access (left) and household wealth distribution (right) in Sierra Leone. Yellow represents areas with high access to electricity, and green represents areas with a high wealth index.
Figure 3: Area maps of electricity access (left) and household wealth distribution (right) in Sierra Leone. Yellow represents areas with high access to electricity, and green represents areas with a high wealth index.
These maps can tell us a few things about the nature of the data. First, because of the sampling strategy of the DHS, we can use the size of the polygons to approximate the population density of each area. The polygons are much smaller on the western tip of the maps. This is indicative of the high population density surrounding Freetown, the capital city. Another trend made apparent by the maps is the spatial coincidence of areas with high wealth along with areas of electricity access. The maps show us immediately that there is a correlation between these two variables. Sure enough the correlation between these two variables is 0.884. The last conclusion we can draw from these maps is that electricity access is concentrated in urban areas. This makes sense since electricity grid infrastructure is much more cost efficient to construct in densely populated areas.
3.2 Regression
First, let us look at the results from the null regression model. This model assumed that the wealth index does not vary across Sierra Leone. The mean value of the wealth index is 623. However, the Root Mean Square Error (RMSE) of this model is an enormous 88,889. Visually, the null model can be represented as a solid color plotted across Sierra Leone. The residuals by region can also be plotted to see where the model under or over predicts.
Figure 4: The null model for wealth index across Sierra Leone and its residuals. It assumes that everywhere has the same value of the mean 623. The residuals are very high, and the RMSE is 88,889.
The next model is the non-spatial multiple linear regression model. This model has a very significant improvement in accuracy, getting the RMSE down to 19,785. However, several of the variables are found to be insignificant, and removing them improves the adjusted R-squared and AIC, but the RMSE is slightly worse at 20,049. The spatial model is calculated using just the significant variables. The spatial lag model has and RMSE of 18807 with ro equal to 0.18, indicating positive spatial correlation. The spatial error model has project best RMSE of 18228, with a lambda of 0.5. This tells us that the errors are also positively correlated. These results mirror the results of the Anselin test for significance of spatial regression, the error model has a higher parameter value and a lower p-value. Therefore, the spatial error model is selected as the top performing model.
Figure 5: A map of the residuals of the basic linear model (left) and the spatial error model (right). The Spatial Error Model was the top performing model.
The results of this spatial regression tell us that indeed the wealth index is not random and that it has a spatial component. Because the spatial error model outperformed the spatial lag model, this tells us that household wealth is likely influenced by external variables that we have not been able to account for in the survey data.
3.3 Kriging
The variogram of the data and the resulting map created from Kriging are featured below:
Figure 6a,b: The Semivariogram of the wealth index as a function of distance in meters (above). The Spatial Interpolation of Wealth Index through the use of Kriging (below).
From the variogram, we can see that the nugget is about 4 billion, the sill is about 8 billion, and the range is approximately 6 kilometers. This is fascinating as it tell us that after about 6 kilometers away from a village, the economic wealth is relatively independent. It would be interesting to see if this correlates at all with the average distance between villages in rural Sierra Leone, and if this is different in urban areas. When tested on actual data, the RMSE of the kriging model was 63,120. This is high relative to the regression models, but it is significantly better than the null model. Finally, the best model to fit the variogram and to use for kriging is an exponential one, which slightly outperforms the spherical model. All of this analysis underscores the spatial nature of wealth and poverty.
3.4 Geographically Weighted Regression
For GWR, each of the different variables that were found to be significant in the regression were used and their coefficients plotted. While many of them had little variation across the country, the two that showed some interesting variation were access to electricity and ownership of goats.
Figure 7: The coefficients of geographically weighted regression for electricity access (left) and for ownership of goats (right). There is a difference in the effect of each of these coefficients from more urban and rural areas.
For electricity access, we see that this is positively correlated with the wealth index. However, this effect is stronger in the eastern part of the country where electricity access rates are lowest, and weakest in the west where rates are the highest. This suggests that energy access is a larger differentiator of wealth in rural areas than urban areas, and underscores the importance of developing energy infrastructure in rural areas. For goat ownership, we see that it is ironically negatively correlated with the wealth index. This effect is stronger in the periurban areas in the western part of the country. While owning a goat is in itself an asset, it seems that households that do not have goats generally have more other assets. This tells us that pastoralism must not be a particularly lucrative source of income in Sierra Leone.
3.5 Decision Trees
The resulting decision tree from the analysis is featured below.
Figure 8: The decision tree resulting from the model that was trained on more variables, both categorical and numerical.
At the top of the decision tree, we see that the source of fuel used for cooking is the most important variable differentiating wealth and poor households. Next most important are building materials and ownership of a television. The overall performance of the decision tree was an RMSE of 32,329. This is interestingly worse than the regression model. It could be because we have constrained the depth of the decision tree. On the other hand, we can extract a full ranking of variable importance from the decision tree.
Figure 9: A bar chart of the importance of all of the variables used for the decision tree.
The top 11 variables from here are:
- Type of cooking fuel
- Type of place of residence
- Has electricity
- Has TV
- Has refrigerator
- Owns land useable for agriculture
- Main floor material
- Main wall material
- Has bank account
- Has watch
- Type of toilet facility
This tells us that there are three main types of variables that tell us most about a households’ economic wealth: building materials for the house such as roof, walls, and floors; ownership of expensive assets such as a television or refrigerator; and access to resources such as cooking fuel, electricity, and toilet facilities. This result will be helpful to ground surveyors looking to assess the economic status of a rural community.
3.6 Discussion
The results found above were generally in line with the hypothesis expected by the authors. They can be summarized as follows:
- Wealth and electricity access are strongly correlated and occur in areas of higher population density.
- Wealth index has a strong spatial component. In addition to predicting it with a variety of independent variables, adding a spatial error component to the model can improve the performance.
- Wealth index can be interpolated well with an exponential kriging model in areas that were not directly surveyed.
- The relationship between various factors such as electricity access and goat ownership varies between urban and rural areas, but is relatively consistent across the country.
- The most important factors for assessing the economic status of a household are (1) building materials, (2) access to resources, and (3) ownership of home appliances.
Further research should aim to leverage satellite imagery to see if there are any features that can be extracted from images that would correlate with wealth index. Additionally, more granular spatial data will allow for more nuanced investigations into the correlation between wealth index and electricity access. As well, surveys from previous years should be analyzed as well to examine how wealth and electricity access are evolving over time. Finally, this analysis should be extended into other countries where there is DHS data to see if the relationships found here still apply.
4. References
DHS. (n.d.). a. The DHS Program - Quality information to plan, monitor and improve population, health, and nutrition programs. Retrieved May 5, 2023, from https://dhsprogram.com/
DHS. (n.d.). b. Spatial Data Repository - Indicator Data. Retrieved May 5, 2023, from https://spatialdata.dhsprogram.com/data/#/
Jean N, Burke M, Xie M, Davis WM, Lobell DB, Ermon S. Combining satellite imagery and machine learning to predict poverty. Science. 2016 Aug 19;353(6301):790-4. doi: 10.1126/science.aaf7894. PMID: 27540167.
Longley, P. (2005). Geographic Information Systems and Science. United Kingdom: Wiley.
United Nations. (n.d.). Goal 7 | Department of Economic and Social Affairs. Retrieved December 9, 2022, from https://sdgs.un.org/goals/goal7
United Nations. (n.d.). The 17 Goals | Sustainable Development. Retrieved December 9, 2022, from https://sdgs.un.org/goals
UNOPS. (2018). Access to energy: Giving Sierra Leone the power to change… | UNOPS. https://www.unops.org/news-and-stories/stories/access-to-energy-giving-sierra-leone-the-power-to-change
USAID. (n.d.). Guide to DHS Statistics. Retrieved December 12, 2022, from https://dhsprogram.com/Data/Guide-to-DHS-Statistics/index.cfm
World Bank. (2021). GDP per capita (current US$). https://data.worldbank.org/indicator/NY.GDP.PCAP.CD
World Bank. (2022). Access to electricity (% of population) - Sub-Saharan Africa | Data. Retrieved December 9, 2022, from https://data.worldbank.org/indicator/EG.ELC.ACCS.ZS?end=2020&locations=ZG&start=1996&view=map