Representing and manipulating data

From Interactive System for Ice sheet Simulation
Jump to: navigation, search

This Section is concerned with how data can be represented in external files and how these data can be visualised. The other part is how internal data structures work. This aspect is handled elsewhere.

Contents


Fortran I/O is a pain. It should, therefore, be kept a simple as possible. Either write/read simple ASCII files or use libraries such as netCDF or HDF5 for more complex needs. If for some reason you need to parse binary files using some strange format I recommend to write the I/O in C and wrap the resulting C functions in Fortran. You can use cfortran.h or the new F03 C interoperability functionality to bind C and Fortran.

Some of the examples on this page can be found on the BerliOS subversion server. Check them out using

svn checkout svn://svn.berlios.de/sms2009/trunk/data_handling/

Simple Tables

Simple tables contain 1D data, i.e. data with a one dimensional dependency such as a time series from a single location or a measurement along a track. These data can be simply stored in an ASCII file containing multiple columns of data. Each row of data should have the same number of data separated by whitespace. Empty lines and comments (strings starting with a special character such as #) should be handled by the parser.

# this is an example data file
1 10 100. 2.
2 3  30.  5.
3 5  60.  3.

Simple ASCII files can be easily created using Fortran. For example the file above can be created by the following Fortran program:

program simple
 implicit none
 integer, parameter :: n = 3!< the number of data
 integer, dimension(n) :: col1 = (/1,2,3/)
 integer, dimension(n) :: col2 = (/10,3,5/)
 real, dimension(n)    :: col3 = (/100.,30.,60./)
 real, dimension(n)    :: col4 = (/2.,5.,3./)
 integer, parameter :: unit = 10
 integer i
 
 
 open(unit,file="test.data",status="unknown")
 write(unit,*) "# this is an example data file"
 do i=1,n
   write(unit,*) col1(i),col2(i),col3(i),col4(i)
 end do
 close(unit)
end program simple

The data can then be read using python with

columns=['col1','col2','col3','col4']
data = {}
for c in columns:
   data[c] = []
for line in open("test.data","r").readlines():
   idx = line.find("#")
   if idx>-1:
      line = line[:idx]
   line = line.strip()
   if len(line) == 0:
      continue
   line = line.split()
   if len(line)<len(columns):
      continue
   for i in range(0,len(columns)):
      data[columns[i]].append(float(line[i]))

A similar program (including sanity checks) can be written in Fortran (as an exercise). Once the data is read we would also like to see what it looks like. Matplotlib is a python module which does all kinds of 2D plotting. It comes with an interface called pylab which is very similar to Matlab. A simple, interactive plot can be created using

import pylab
 
pylab.plot(data["col1"],data["col2"])
pylab.show()

Simple ASCII files are easy to handle, you can look at (and change) them using your favorite editor. They are also platform independent. However, they take up a lot of space, e.g. an 8byte double precision value becomes a 20byte string and take long to read/parse/write (conversion between binary and string representation). The next section introduces binary data files which are language and platform independent.


Gridded Multi-dimensional Data

As we have seen in the previous section, small 1D datasets can be easily stored in plain ASCII files. For large or multi-dimensional datasets ASCII files become prohibitive. Binary files are much more efficient to store large datasets. However, there are many ways of encoding integers and floating point numbers using binary:

  • data size, e.g. 1, 2, 4, 8 byte words
  • ordering of indices of multi-dimension data, i.e. is the format row major (C or python) or column major (Fortran or Matlab)
  • byte ordering: big/little endian
  • floating point representations, e.g. IEEE 754. See also this paper on how floating points are represented and the associated pitfalls[1]

These issues makes handling binary data difficult. In order for binary files to work their format has to be precisely specified. This is a pain. Luckily, there are libraries which do all the complicated work.

There are two somewhat competing offerings: netCDF and HDF. They both offer a platform and language independent way of writing/reading binary files. They are both self-describing, i.e. metadata can be attached to the data that fully specifies every aspect, e.g. measurement units. They also both support multi-dimensional data that can be sliced and diced in any way. Their main difference is that netCDF is much simpler to use but does not support advanced features of HDF such as data compression, parallel I/O, user-defined data types, bitmaps, etc. However, this difference is somewhat reduced with the recent versions of netCDF v4 which offers an advanced data model (while keeping the API simple) that is build on top of HDF5.


Vanilla netCDF

The netCDF User Guide is an excellent source for background information on how netCDF works. The API Guides (C and Fortran90) give details on how the library is used.

Components of a netCDF File

netCDF files consist of three types of data:

dimensions
define the number of items along one coordinate axis. A file can have multiple fixed length dimensions. It can have only one unlimited dimension (one that can grow), unless you are using the new netCDF-4 format[2].
variables
have a type (integer, real, etc.) and are spanned by a list of dimensions.
attributes
store metadata. They can be either global or attached to variables.

File Formats

The current version of the netCDF library supports multiple file formats:

  • the deprecated version 2 format, you can forget about this format unless you need to deal with legacy files
  • the classic format. This is the format you get by default
  • the 64bit format which is an extension of the classic format to allow larger data sets
  • the netCDF-4 classic format which allows the same kind of data objects as the classic or 64bit formats to be stored in the new HDF5 based data files. Although this format does not provide the advanced features of the new format it does offer performance enhancements such as chunked and parallel I/O
  • the new netCDF-4 format offers multiple unlimited dimensions, groups, derived types, etc.

When you create a new netCDF file you can specify which format you want to use.

The classic format uses 32bit file pointers and is therefore limited to files smaller than 2GiB, while the 64bit classic format is limited to variables of size less than 4GiB per record. Both classic formats only support 8, 16 and 32bit unsigned integers and 32 and 64bit floats. The classic formats support only one unlimited dimension. Most limitations are lifted in the new HDF5 based file formats.

CDL

netCDF is a binary format, however it can be sometimes useful to represent the data in a human readable ASCII format. This format is called network common data form description language or CDL for short. You can use the ncdump utility to view the CDL representation of a netCDF file:

ncdump -h <file>

The -h flag only dumps the header of the netCDF file without the data section. A CDL file (including some data) might look something like

netcdf example {    // example netCDF specification in CDL
 
dimensions:
  lat = 10, lon = 5, time = unlimited;
 
variables:
  int lat(lat);
      lat:long_name = "latitude";
      lat:units = "degreesN";
  int lon(lon);
      lon:long_name = "longitude";
      lon:units = "degreeE";
  int time(time);
      time:units = "years";
  float v(time,lat,lon);
       v:long_name = "some quantity v";
 
// global attributes
      :author = "Magnus Hagdorn";
 
data:
   lat   = 0, 10, 20, 30, 40, 50, 60, 70, 80, 90;
   lon   = -140, -118, -96, -84, -52;
}

You can also create a netCDF file from a CDL file using the ncgen utility:

ncgen -o example.nc example.cdl

The CDL is very useful for documenting your file format.

The Fortran90 API

The Fortran90 API documentation is very good so have a look. All netCDF calls are functions which return an integer indicating whether the call was successful or not. You should really check these and act accordingly.

A skeleton F90 program could be

!> simple program loading a netCDF file
!! \author Magnus Hagdorn
!! \date July 2009
program simple
  use netcdf
  implicit none
 
  integer status, ncid
 
  status = nf90_open("example.nc",NF90_NOWRITE,ncid)
  call checkError(status,__FILE__,__LINE__)
 
  status = nf90_close(ncid)
  call checkError(status,__FILE__,__LINE__)
 
contains
  subroutine checkError(status,fname,line)
    implicit none
    integer, intent(in) :: status                   !< the return value of the netCDF call
    character(len=*), optional, intent(in) :: fname !< name of the file where error occured
    integer, optional, intent(in) :: line           !< line number where error occured
 
    if (status/=NF90_NOERR) then
       write(*,'(a)',advance='no') 'Error '
       if (present(fname) .and. present(line)) then
          write(*,'(a,i0,a)',advance='no') '('//trim(fname)//',',line,') '
       end if
       write(*,'(a)') nf90_strerror(status)
       stop
    end if
  end subroutine checkError
end program simple

Note, this program must be compiled with the preprocessor enabled (just call the program simpleNetCDF.F90). The programs are compiled using

gfortran -o simpleNetCDF simpleNetCDF.F90 -lnetcdf -lnetcdff

netCDF and python

Unfortunately, there are many different (and not quite finished) python interfaces to netCDF[3]. This is not helped by the fact that there are a number of different implementations of the numeric arrays (however, you should use numpy). In my experience the netCDF interface coming from Scientific Python works best (although the website is frequently unavailable). The netcdf4-python module looks also very promising. Scientific Python is available for Ubuntu so that is what we are going to use.

In order to be able to use the Python netCDF interface you need to load the module

>>> import Scientific.IO.NetCDF

You can then open an existing file

>>> aFile=Scientific.IO.NetCDF.NetCDFFile('example.nc','r')

netCDF dimensions and variables are mapped to python dictionaries:

>>> print aFile.dimensions.keys()
['lat', 'lon', 'time']
>>> print aFile.variables.keys()
['lat', 'v', 'lon', 'time']

The variable object has a method called dimensions which can be used to get a tuple of the dimensions associated with the variable, e.g.

>>> print aFile.variables['v'].dimensions
('time', 'lat', 'lon')

The data is a multidimensional array, e.g.

>>> print aFile.variables['lat'][:]
array([0, 10, 20, 30, 40, 50, 60, 70, 80, 90], 'i')
>>> aFile.variables['lat'].getValue()
array([0, 10, 20, 30, 40, 50, 60, 70, 80, 90], 'i')

netCDF attributes are mapped to python attributes of the variable instances (and the file instance for global variables)

>>> print aFile.author
Magnus Hagdorn
>>> print aFile.variables['v'].long_name
some quantity v

Once you are done with a file you should also close it

>>> aFile.close()

You can also create or modify netCDF files. You just need to open them 'w' or 'a'. You then just write to the file accessing the data structures:

>>> aFile.author = 'Foo Bar'
>>> aFile.history = "some modifications to the file" # create a new attribute
>>> import numpy
>>> aFile.variables['v'][1,:,:] = numpy.ones(aFile.variables['v'].shape[1:],'f')
>>> aFile.close()

You can see that the python interactive shell is actually quite useful if you need to quickly manipulate a netCDF file.

Plotting Variables

matplotlib can be used to plot 2D data sets, e.g.

import pylab
import numpy
import Scientific.IO.NetCDF
 
time=-1
 
# load file
aFile = Scientific.IO.NetCDF.NetCDFFile('fenscan.2000a.nc','r')
 
# create a mask based on ice thicknesses
mask = numpy.array(aFile.variables['thk'][time,:,:])<1.e-5
 
# create array holding ice thickness
thk = numpy.ma.array(aFile.variables['thk'][time,:,:],mask=mask)
 
pylab.imshow(thk,origin='lower')
pylab.show()


CF

The netCDF library implements a standard for storing multi-dimensional data and associated meta data. It does not standardise the kind of meta data that is used to describe the data. Defining an ontology is quite complex. There are a number of netCDF metadata conventions. In particular the NetCDF Climate and Forecast (CF) Metadata Convention is applicable to ice sheet modelling.

The CF standard does not specify how netCDF variables are named. It does specify what attributes should be attached. In particular variables should have an attribute standard_name which exactly describes the quantity being stored in the variable. There is a big list of standard names and a process for adding new names. The units of measurement should also be specified using the attribute units. Global attributes are used to describe the convention and the file contents. You should have a look at some example files.

Have a look at the seaRISE data sets which are CF 1.3 compliant.

Non-Regular Grids

So far we have only dealt with regular grids, i.e. grids that are spanned by the Cartesian tensor product x{\otimes}y where x are points along the x-axis and and y are points along the y-axis.

The simplest departure from a regular grid is one with the same topology i.e. each node continues to have 4 connected neighbours but the node positions cannot be described by a Cartesian tensor product. In this case coordinates for each node have to be specified explicitly. We already do this for the latitudes and longitudes of our meshes.

In some cases it might be possible to use a sparse matrix format for storing a strange collection of points. But usually the next option is to use an unstructured mesh. Unstructured meshes typically arise in finite element modelling. Finite element meshes are typically stored as a list of node coordinates and lists of elements referring to node IDs. Coming with your own format is not that difficult, however, since it is a fairly standard problem you might as well use one of the many available libraries for handling finite element meshes. This is especially true because mesh generation and visualisation is quite a pain, so using existing software is most likely to be easier. I'd like to point out two particular formats which come with both C and Fortran interfaces:

  • ExodusII is netCDF based
  • silo is HDF5 based and can store all sorts of meshes

Multi-dimensional Complex Data

Finally, there are data sets that have a very complex structure that is very irregular. These data are best stored in a relational database. Typically these are programmed using SQL (structured query language). There are many databases available ranging from single file DBs such as SQLite, via client/server based DBs such as MySQL and PostgreSQL to serious expensive ones, e.g. Oracle. If you restrict yourself to standard SQL it should be relatively straight forward to port your software to different databases. For small data sets I quite like to use SQLite which stores data in a single platform independent file (note: there are potential file locking issues) and has bindings to many languages.

A Fancy Plotting Script

The following does a nice image.

import pycdf as pycdf
import numpy as np
from pylab import clf,colorbar
from mpl_toolkits.basemap import Basemap
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as colors
import scipy.signal as signal
 
#Set up the Output netCDF file
nc = pycdf.CDF("5kmGreen_balvel.nc", pycdf.NC.NOWRITE)
nc.automode()
 
nc2 = pycdf.CDF("Greenland_5km_v7.nc", pycdf.NC.NOWRITE)
nc2.automode()
 
 
# read in topo data (on a regular lat/lon grid)
lons  = nc2.var('lon')[0,:]
lats  = nc2.var('lat')[0,:]
balvel  = nc.var('Balance_velocity')[1,:]
 
#lons[balvel<0.]=1.e25
#lats[balvel<0.]=1.e25
#balvel[balvel==0.]=1.e20
balvel[balvel<0.5] = 0.5
#balvel[balvel==1.e20]=0.
 
clf()
# create Basemap instance for Robinson projection.
m = Basemap(width=1500000,height=2800000,\
            resolution='l',projection='stere',\
            lat_ts=71,lat_0=72,lon_0=-42.)
 
# compute map projection coordinates for lat/lon grid.
# make filled contour plot.
x,y = m(*(lons,lats))
 
breakPoints = np.array([.5-1.e-9 ,4.,8.,32.,64.,128.,256.,512.,1024.,2048.])
 
balvel[~np.isfinite(balvel)]=0.
cs = m.contourf(x,y,balvel,breakPoints,cmap=plt.cm.gist_ncar,\
               norm=colors.LogNorm(),clip=False,alpha=0.7,antialiasing=False)
 
#cs = m.pcolor(x,y,balvel,norm=colors.LogNorm())
 
cb = plt.colorbar(cs,ticks=breakPoints,orientation='vertical')     
cb.ax.set_yticklabels(['0','4','8','32','64','128','256','512','1024','2048'])    
cb.set_label('metres per annum')
 
m.bluemarble()
m.drawcoastlines() # draw coastlines
m.drawmapboundary() 
 
m.drawparallels(np.arange(60.,80.,5.),labels=[1,0,0,0]) # draw parallels
m.drawmeridians(np.arange(300.,350.,10.),labels=[0,0,0,1]) # draw meridians
 
plt.title('Greenland Balance Velocities (5 ice thickness averaging)') # add a title
plt.show()
 
nc.close()