User Guide Part 2: What's in an Extract?

Return to User Guide Table of Contents

Extract file overview

Screenshot listing files in the example extract, including four metadata files and four data files, one for Belize and three for Latvia.

You will receive your extract as a .zip file, which you can open with your favorite unzipping utility. Inside the .zip file, you will find four metadata files and one or more data files.

The metadata file names begin with "ihgis" and your extract number.

Data file names consist of:

  • Two-letter country code
  • Year
  • Dataset type code (e.g., "ag" for agricultural census, "pop" for population census)
  • Three-character table code
  • Hierarchical level code (g0 is national, g1 is the largest subnational units, etc.)

 

Codebook

The ihgisXXXX_codebook.txt file is a human-readable summary of the contents of your extract. It includes basic information about the datasets, tables, and variables, as well as the recommended citation for IHGIS.

The other metadata files are provided as comma-separated values files to facilitate importing them into statistical packages or other software tools.

Tabulation geography metadata

Screenshot of a tabulation geography metadata CSV file showing one tabulation geography for Belize and two for Latvia.

The ihgisXXX_geog.csv file provides the name of each tabulation geography that is included in one or more tables in your extract.

Table metadata

The ihgisXXXX_tables.csv file provides detailed metadata for each table in your extract. The fields in this file consist of:

  • dataset, table, and datset_table: Codes for the dataset, data table, and a concatentaion in which the two codes are separted by a period.
  • title: Title of the table.
  • table_num: Designation fo the table in the source document.
  • table_universe: Entities considered in the table. For percents and ratios, the universe refers to the denominator.
  • tabulation_geogs: Tabulation geographies for which the table is available.
  • tabulation_geog_labels: Names of the tabulation geographies.
  • source_pub_eng: Title of the document or document series in which the table was originally published. It may be a translation into English of the original native-language title.
  • country: The country the table describes.
  • footnotes: Any footnotes present for the table. (May not be present.)

 

Screenshot of a table metadata CSV file showing one table for Belize and two tables for Latvia.

Data dictionary (variable metadata)

The ihgisXXXX_datadict.csv file provides detailed metadata for the variables (columns) in the tables in your extract. This information is key to interpreting the data in the data files. The fields in this file consist of:

  • dataset, table: Codes matching the file name of the data file containgg the listed variables.
  • table_var: Codes prviding the link to the column headers in the data files.
  • label: Description of the variable represented in the correspondign column in the data files, i.e. the column header
  • data_year: Year represented by the data in a given column, which may be different from the year of the dataset. For example, a table may describe population growth over time, with population counts from several years prior to the census.
  • universe: Describes the scope of who or what is covered by the variable. For example, data on marital status or economic activity may only cover persons over a certain age. For percents and ratios, the universe refers to the denominator.
  • agg_method: Arithmetic operation used to aggregate information from individual census responses to calculate the summary values in the table. The most common aggregation methods are count and percent.
  • agg_detail: Additional aggregation details necessary to fully describe how the variable was calculated. For example, aggregation details may include units of measurement, numerators and denominators of ratios, or scaling factors.

 

Screenshot of a data dictionary CSV file showing variable-level metadata for a table from Belize 2010 and one from Latvia 2011.

Data files

Each data file contains data from a table in the source document for one tabulation geography. (In cases where the source document included separate tables for subnational geographic units, those tables have been combined into nation-wide data files.)

GISJOIN codes provide the link between rows of data and polygons in the GIS boundary files. You may join data files to shapefiles in a GIS package using the GISJOIN field in both files.

The next set of columns (g0, g1, g2…) provides the names of the geographic units and their parent units.

The remaining columns provide the actual data. The codes in the header row (e.g., AAA001) correspond to variable descriptions in the codebook and data dictionary.

Screenshot of a data CSV file for a table from Belize 2010.