Reading AERONET data in Pandas: a simple helper function

I use data from the AERONET network of sun photometers a lot in my work, and do a lot of processing of the data in Python. As part of this I usually want to load the data into pandas – but because of the format of the data, it’s not quite as simple as it could be.

So, for those of you who are impatient, here is some code that reads an AERONET data file into a pandas DataFrame which you can just download and use:


400: Invalid request

For those who want more details, read on…

Once you’ve downloaded an AERONET data file and unzipped it, you’ll find you have a file called something like 050101_161231_Chilbolton.lev20, and if you look at the start of the file it’ll look a bit like this:

Level 2.0. Quality Assured Data.<p>The following data are pre and post field calibrated, automatically cloud cleared and manually inspected.
Version 2 Direct Sun Algorithm
Location=Chilbolton,long=-1.437,lat=51.144,elev=88,Nmeas=13,PI=Iain_H._Woodhouse_and_Judith_Agnew_and_Judith_Jeffrey_and_Judith_Jeffery,Email=fsf@nerc.ac.uk_and__and__and_judith.jeffery@stfc.ac.uk
AOD Level 2.0,All Points,UNITS can be found at,,, http://aeronet.gsfc.nasa.gov/data_menu.html
Date(dd-mm-yy),Time(hh:mm:ss),Julian_Day,AOT_1640,AOT_1020,AOT_870,AOT_675,AOT_667,AOT_555,AOT_551,AOT_532,AOT_531,AOT_500,AOT_490,AOT_443,AOT_440,AOT_412,AOT_380,AOT_340,Water(cm),%TripletVar_1640,%TripletVar_1020,%TripletVar_870,%TripletVar_675,%TripletVar_667,%TripletVar_555,%TripletVar_551,%TripletVar_532,%TripletVar_531,%TripletVar_500,%TripletVar_490,%TripletVar_443,%TripletVar_440,%TripletVar_412,%TripletVar_380,%TripletVar_340,%WaterError,440-870Angstrom,380-500Angstrom,440-675Angstrom,500-870Angstrom,340-440Angstrom,440-675Angstrom(Polar),Last_Processing_Date(dd/mm/yyyy),Solar_Zenith_Angle
10:10:2005,12:38:46,283.526921,N/A,0.079535,0.090636,0.143492,N/A,N/A,N/A,N/A,N/A,0.246959,N/A,N/A,0.301443,N/A,0.373063,0.430350,2.115728,N/A,0.043632,0.049923,0.089966,N/A,N/A,N/A,N/A,N/A,0.116690,N/A,N/A,0.196419,N/A,0.181772,0.532137,N/A,1.776185,1.495202,1.757222,1.808187,1.368259,N/A,17/10/2006,58.758553

You can see here that we have a few lines of metadata at the top of the file, including the ‘level’ of the data (AERONET data is provided at three levels, 1.0, 1.5 and 2.0, referring to the quality assurance of the data), and some information about the AERONET site.

In this function we’re just going to ignore this metadata, and start reading at the 5th line, which contains the column headers. Now, you’ll see that the data looks like a fairly standard CSV file, so we should be able to read it fairly easily with pd.read_csv. This is true, and you can read it using:

df = pd.read_csv(filename, skiprows=4)

However, you’ll find a few issues with the DataFrame you get back from that simple line of code: firstly dates and times are just left as strings (rather than being parsed into proper datetime columns) and missing data is still shown as the string ‘N/A’. We can solve both of these:

No data: read_csv allows us to specify how ‘no data’ values are represented in the data, so all we need to do is set this: pd.read_csv(filename, skiprows=4, na_values=['N/A']) Note: we need to give na_values a list of values to treat as no data, hence we create a single-element list containing the string N/A.

Dates & times: These are a little harder, mainly because of the strange format in which they are provided in the file. Although the column header for the first column says Date(dd-mm-yy), the date is actually colon-separated (dd:mm:yy). This is a very unusual format for a date, so pandas won’t automatically convert it – we have to help it along a bit. So, first we define a function to parse a date from that strange format into a standard Python datetime:

dateparse = lambda x: pd.datetime.strptime(x, "%d:%m:%Y %H:%M:%S")

I could have written this as a normal function (def dateparse(x)), but I used a lambda expression as it seemed easier for such a short function. Once we’ve defined this function we tell pandas to use it to parse dates (date_parser=dateparse) and also tell it that the first two columns together represent the time of each observation, and they should be parsed as dates (parse_dates={'times':[0,1]}).

Putting all of this together, we get:

dateparse = lambda x: pd.datetime.strptime(x, "%d:%m:%Y %H:%M:%S")
aeronet = pd.read_csv(filename, skiprows=4, na_values=['N/A'],
                      parse_dates={'times':[0,1]},
                      date_parser=dateparse)

That’s all we need to do to read in the data and convert the right columns, the rest of the function just does some cleaning up:

We set the times as the index of the DataFrame, as it is the unique identifier for each observation – and makes it easy to join with other data later.
We remove the JulianDay column, as it’s rather useless now that we have a properly parsed timestamp
We drop any columns that are entirely NaN and any rows that are entirely NaN (that’s what dropna(axis=1, how='all') does).
We rename a column, and then make sure the data is sorted

aeronet = aeronet.set_index('times')
del aeronet['Julian_Day']

# Drop any rows that are all NaN and any cols that are all NaN
# & then sort by the index
an = (aeronet.dropna(axis=1, how='all')
            .dropna(axis=0, how='all')
            .rename(columns={'Last_Processing_Date(dd/mm/yyyy)': 'Last_Processing_Date'})
            .sort_index())

You’ll notice that the last few bits of this ‘post-processing’ were done using ‘method-chaining’, where we just ‘chain’ pandas methods one after another. This is often a very convenient way to work in Python – see this blog post for more information.

So, that’s how this function works – now go off and process some AERONET data!