Revision as of 11:25, 4 October 2006; view current revision
←Older revision | Newer revision→
5.2. Filling gaps in agricultural statistics.
René Gommes, Linda See, Peter Hoefsloot
Examples of “missing” statistical data
National agricultural statistics data are easily available, as they are published by the countries as statistical yearbooks and systematically assembled by FAO into the annual Production Yearbooks (FAOSTAT ). Sub-national data, on the other hand, are usually difficult to retrieve, and subject to a number of complications when they have to be consolidated into a homogeneous and consistent set at the continental level.
It occurs rather often that the data are scarce or unavailable altogether. Possible causes:
- no sampling is carried out at the national level. Sometimes subjective estimates are produced, but they are of uncertain quality;
- data are collected at the national level, but never documented or actually published in national statistical yearbooks. However, some of those data are available nationally from the concerned services;
- different data are collected for different geographic units (for instance, not all AUs collect data for all crops, or sometimes the different AUs apply for agriculture and, say population;
- data are aggregated (by areas or by crops) in a way which is not compatible with the reporting of other countries. A typical example would be “millet” which can be bulrush millet or finger millet, or both together, although the crops are rather different from an auto-ecological point of view. The worst example being “millet and sorghum” reported as either “millet” or “sorghum”;
- data are not available for all the years from 1986 to 1990 in the reference period, or they are available only for years outside the reference period
- the AU units were modified during the reference period. This creates a minor difficulty when AU are aggregated, but when they are split, it is not always possible to redistribute the statistics between the new AU;
- the amounts cultivated and harvested are deemed to be negligible if they are below a cut-off that varies according to countries;
- crops are not reported on separately, for instance white potatoes can sometimes be lumped with vegetables and appear nowhere in the statistics.
In order to fill the gaps, parameters will have to be interpolated or otherwise estimated.
Examples of a strategy for filling gaps in maps
A number of methods can be applied to fill gaps. It depends on the agricultural parameter at hand, which methods are applied. Usually a sequence of methods is necessary. Some examples are given below.
Production
- Disaggregation of national production
- Spatialization of per capita production x AU population
- AU yield x AU cropped area
Yield
- Regression against environmental variable
- Inverse-distance spatialization
- SEDI interpolation
Cropped Area
- Disaggregation of national cropped area
- Spatialization of area per capita x AU population
- Spatialization of relative area x AU total area
Considering, for instance, that AU yield can be derived using three methods and that cropped area can also be derived with different approaches, the number of “final” values is thus rather high. Also consider that sums can be estimated as sums (“coarse grains”) and as the individual components (i.e. maize, barley, millet, sorghum etc.), which in turn add another set of possible values. Whenever the estimates are reasonably close, they can be averages without much afterthought... but when they differ by an order of magnitude, subjective choices have to be made.
Therefore, the results are user-dependent and they result from a set of common-sense rules rather than from a method as such. It is assumed that the methods used as well as the harmonization with FAOSTAT ensures that the final product is consistent.
Description of Methods of estimation
Disaggregation of quantities
Disaggregation consists in re-distributing the total national production or areas in proportion to some other known variable, for instance population. This can be applied only if the country is relatively homogeneous from an agroecological point of view.
Inverse distance weighting
Geostatistical interpolation of missing data consists in the estimation of missing values at one point in space based on the known values of neighboring entities. The inverse-distance weighting is one of the most straightforward methods; it takes into account the distance between the “known” and “unknown” points and their relative importance in the estimation. For instance, close-by AU’s of the same country are assigned a higher weight than the AU’s of neighboring countries distance. The reason for this is that it is considered that production and feeding behavior are more homogeneous within the country.
For the geo-statistical interpolation, it can generally be considered that reported crop yield corresponds to the centre of gravity of the AU. Inverse-distance weighting was applied mainly to Area cultivated per capita, per capita production and relative area. The AgrometShell function Tools-Interpolate to replace missing data can be used to perfome this type of operation.
External variables useable for spatial interpolation
The idea behind this paragraph is to use pixel-based information that is generally available from non-national sources, like satellite imagery, or topographic information from digital terrain models, etc., to assist with the estimation of the missing statistical information by administrative unit (AU).
As indicated, the spatial interpolation can either be purely geo-statistical, or take advantage of the additional knowledge obtained from external variables. In the first category, the method known as “inverse distance weighting”. In the second, the method was Satellite Enhanced Data Interpolation (SEDI).
In semi-arid areas, good correlations can be found between between environmental conditions and yield (K/H). Once average AU values of NDVI, elevation, etc. are available for surrounding areas, yields can be regressed against the external environmental variables . The method is not applicable if cultivars vary in the same agroclimatic area. Generally this information is not included in the statistics, and the database was thus considered cultivar- independent. It was also not feasible to distinguish between irrigated and rainfed crops, subsistence farming and large scale modern agricultural production.
The main external variable of interest was NDVI (Normalized Difference vegetation Index), created by NASA/GIMMS and regularly available every 10 days since 1981. It represents one of the most popular remotely sensed indicators for monitoring the response of vegetation to weather condition in several parts of Africa.
NDVI is an indicator of the density of living green mass; in theory, it varies from -1 to +1, but in practice only values between 0 to 0.7 are found on land areas.
1981-1991 average monthly NDVI data from ARTEMIS (FAO, 1993) were used to derive the NDVI variables included in the AGDAT database. The most relevant, in the current context, are NDVI monthly average, maximum and minimum together with the same values relative to the value of 0.12. This threshold corresponds to the occurrence of green vegetation on the ground. The interpretation of NDVI in humid tropical regions is difficult due to the absorption of infra-red light by water vapor and because the response of the index as a unction of biomass reaches a plateau (saturation) at high biomass values.
Figure 1 below illustrates a typical relation between yield and NDVI, showing, among others, NDVI values starting at about 0.07 and yields levelling off from values above 0.2. Note that the values indicated correspond to the spatial average over a whole AU, where no crops are actually grown below about 0.12. This explains why relatively high uields can be found at low NDVI when some crops are irrigated or grown in the wettest parts of the AU only.
Coarse grain yield and NDVI in Burkina Faso and the countries surrounding Burkina Faso. The figure was restricted to yields below 1200 Kg/Ha.
Satellite Enhanced Data Interpolation, or SEDI
SEDI takes advantage of the correlation between and environmental variable, for instance the above mentioned NDVI/biomass and agricultural yields. One of the ways to approach this is co-kriging, a variant of kriging using one or more ausiliary variable and exploiting both the spatial features of the variable to be interpolated and the correlations between the variable and the ausiliary variables (Bogaert et al, 1995)
The SEDI interpolation method originated in a Harare based FAO Regional Remote Sensing Project. It was originally developed to interpolate rainfall data collected at station level using the additional information provided by METEOSAT cold cloud duration images. The methods proved powerful and versatile, and it is now regularly used by FAO to spatially interpolate other parameters as well (e.g. potential evapotranspiration, crop yields, actual crop evapotranspiration estimates etc.
The concepts of this interpolation method and software implementing the technique have been described by Hoefsloot, 1996. The SEDI functions were recently incorporated into the WINDISP_3 software (Pfirman and Hogue, 1998)
SEDI is a simple and straightforward method for 'assisted' interpolation. The method can be applied to any parameter of which the values are available for a number of geographical locations, as long as a 'background' field is available that has a negative or positive relation to the parameter that needs to be interpolated.
Three requirements are a prerequisite for the successful application of the SEDI method:
- The availability of the parameter to interpolate as point data at different geographical locations (e.g. rainfall, potential evapotranspiration, crop yields). In the present case of statistical variables, they were assigned a co-ordinate corresponding to the centre of gravity of the AU;
- The availability of a background parameter in the form of a regularly spaced grid (or field) for the same geographical area (e.g. the above-mentioned NDVI variables , altitude).
- A relation between the two parameters (negative or positive; Yield/NDVI is positive, temperature/altitude would be negative). A Spearman rank correlation test can reveal whether a relation exists, and how strong this relation is.
The SEDI method yields the parameter mentioned under point 1 as a field (i.e. an image covering the whole area under consideration. The average of the field value over the AU provides the estimation of the spatialized statistic. The method is illustrated below using rainfall and “Cold Cloud Duration”.
A detailed description of the SEDI method can be found here