Comparing the Pandas DataFrame and Astropy Tables

version 1.0

Wing-Fai Thi 21/05/2018

Pandas is a very powerful data manupulation package for Python and it is used by quasi all the data scientists. Astronomers have developed more dedicated packages to deal with large astronomical datasets. One of the most mature package is Astropy. Astropy has many tools to manage data (images and tables), to read and write them in the formats that are commonly used in astronomy. The equivalent in astronomy of the HDF5 format to save large amount of binary data is the fits format. FITS (Flexible Image Transport System) is the data format most widely used within astronomy for transporting, analyzing, and archiving scientific data files. The format is endorsed by both NASA and the International Astronomical Union (IAU). For more information about the FITS standard, go to the FITS Web site, http://fits.gsfc.nasa.gov/ and https://www.fileformat.info/format/fits/egff.htm.

Astropy includes powerful tools to read, manage, and write images and tables into fits file.

The fits format supports for unlimited metadata in the header, for example the sky coordinates, information about the telescope, etc in addition to the keys for tables.

In this notebook we will explore a couple of methods provided by Pandas and Astropy to deal with tabular data. It is a work in progress and is not an exhaustive overview of the methods in both packages.

This notebook requires Python 3.6 an Astropy version 3.0.1 (Astropy version >3 is only available for Python > 3.4).

Reading a csv file

Although Pandas and Astropy can deal with specific format not supported by the other packages (Excel, json format for Pandas and fits format for Astropy), they both accept other common file formats such as HDF5, ascii and cvs formats. In the first part of the notebook, we take as example the Titanic passenger manifest, whose data are in the CVS format. Comma Separated Value files (CSV) are widely used (and an export and import option for programs like Excel). Python has a standard library that can read csv format:

In [1]:
from __future__ import division, absolute_import, print_function

# Method 1: Standard Python
from csv import reader
def load_csv(filename):
    dataset = list() # we put the dataset in a list
    with open(filename,'r') as file:
        csv_reader = reader(file)
        for row in csv_reader:
            if not row:
                continue
            dataset.append(row)
    return dataset

The titanic dataset can be found at various places.

In [2]:
dataset=load_csv('titanic.csv')
dataset[0:2][:]
Out[2]:
[['PassengerId',
  'Survived',
  'Pclass',
  'Name',
  'Sex',
  'Age',
  'SibSp',
  'Parch',
  'Ticket',
  'Fare',
  'Cabin',
  'Embarked'],
 ['1',
  '0',
  '3',
  'Braund, Mr. Owen Harris',
  'male',
  '22',
  '1',
  '0',
  'A/5 21171',
  '7.25',
  '',
  'S']]
In [3]:
# Method 2: Pandas
import pandas as pd
In [4]:
df = pd.read_csv('titanic.csv')
In [5]:
df.head(5)
Out[5]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
In [6]:
df.tail(5)
Out[6]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.00 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.00 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.45 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.00 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.75 NaN Q
In [7]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
In [8]:
df.describe()
Out[8]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
In [9]:
# Method 3: astropy.table
from astropy.table import Table, unique
In [10]:
tab = Table.read('titanic.csv')
In [11]:
tab.info
Out[11]:
<Table masked=True length=891>
    name     dtype  n_bad
----------- ------- -----
PassengerId   int64     0
   Survived   int64     0
     Pclass   int64     0
       Name   str82     0
        Sex    str6     0
        Age float64   177
      SibSp   int64     0
      Parch   int64     0
     Ticket   str18     0
       Fare float64     0
      Cabin   str15   687
   Embarked    str1     2
In [12]:
tab
Out[12]:
Table masked=True length=891
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
int64int64int64str82str6float64int64int64str18float64str15str1
103Braund, Mr. Owen Harrismale22.010A/5 211717.25--S
211Cumings, Mrs. John Bradley (Florence Briggs Thayer)female38.010PC 1759971.2833C85C
313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.925--S
411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1C123S
503Allen, Mr. William Henrymale35.0003734508.05--S
603Moran, Mr. Jamesmale--003308778.4583--Q
701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
803Palsson, Master. Gosta Leonardmale2.03134990921.075--S
913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female27.00234774211.1333--S
1012Nasser, Mrs. Nicholas (Adele Achem)female14.01023773630.0708--C
....................................
88203Markun, Mr. Johannmale33.0003492577.8958--S
88303Dahlberg, Miss. Gerda Ulrikafemale22.000755210.5167--S
88402Banfield, Mr. Frederick Jamesmale28.000C.A./SOTON 3406810.5--S
88503Sutehall, Mr. Henry Jrmale25.000SOTON/OQ 3920767.05--S
88603Rice, Mrs. William (Margaret Norton)female39.00538265229.125--Q
88702Montvila, Rev. Juozasmale27.00021153613.0--S
88811Graham, Miss. Margaret Edithfemale19.00011205330.0B42S
88903Johnston, Miss. Catherine Helen "Carrie"""female--12W./C. 660723.45--S
89011Behr, Mr. Karl Howellmale26.00011136930.0C148C
89103Dooley, Mr. Patrickmale32.0003703767.75--Q

177 entries for age are missing

Standard Python

The data are read in a list. Further parsing is required.

Pandas

The missing values are automatically replaced by NaN. Pandas object-type means string.

Astropy

The missing values are not automatically assigned NaN.

Astropy table quantities and units

A column can have specific units for quantitative data.

In [13]:
from astropy import units as u
u.s.find_equivalent_units() # find time units in addition to seconds (s)
Out[13]:
  Primary name | Unit definition | Aliases 
[
  a            | 3.15576e+07 s   | annum    ,
  d            | 86400 s         | day      ,
  fortnight    | 1.2096e+06 s    |          ,
  h            | 3600 s          | hour, hr ,
  min          | 60 s            | minute   ,
  s            | irreducible     | second   ,
  sday         | 86164.1 s       |          ,
  wk           | 604800 s        | week     ,
  yr           | 3.15576e+07 s   | year     ,
]
In [14]:
tab['Age'].unit='year'
tab
Out[14]:
Table masked=True length=891
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
yr
int64int64int64str82str6float64int64int64str18float64str15str1
103Braund, Mr. Owen Harrismale22.010A/5 211717.25--S
211Cumings, Mrs. John Bradley (Florence Briggs Thayer)female38.010PC 1759971.2833C85C
313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.925--S
411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1C123S
503Allen, Mr. William Henrymale35.0003734508.05--S
603Moran, Mr. Jamesmale--003308778.4583--Q
701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
803Palsson, Master. Gosta Leonardmale2.03134990921.075--S
913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female27.00234774211.1333--S
....................................
88203Markun, Mr. Johannmale33.0003492577.8958--S
88303Dahlberg, Miss. Gerda Ulrikafemale22.000755210.5167--S
88402Banfield, Mr. Frederick Jamesmale28.000C.A./SOTON 3406810.5--S
88503Sutehall, Mr. Henry Jrmale25.000SOTON/OQ 3920767.05--S
88603Rice, Mrs. William (Margaret Norton)female39.00538265229.125--Q
88702Montvila, Rev. Juozasmale27.00021153613.0--S
88811Graham, Miss. Margaret Edithfemale19.00011205330.0B42S
88903Johnston, Miss. Catherine Helen "Carrie"""female--12W./C. 660723.45--S
89011Behr, Mr. Karl Howellmale26.00011136930.0C148C
89103Dooley, Mr. Patrickmale32.0003703767.75--Q

One can easy list the ages in days.

In [15]:
tab['Age'].to(u.day)
Out[15]:
$[8035.5,~13879.5,~9496.5,~\dots,~0,~9496.5,~11688] \; \mathrm{d}$
In [16]:
tab.info('stats')
<Table masked=True length=891>
    name            mean                std         min    max    n_bad
----------- ------------------- ------------------- ---- -------- -----
PassengerId               446.0  257.20938292890224    1      891     0
   Survived  0.3838383838383838 0.48631931786709987    0        1     0
     Pclass   2.308641975308642  0.8356019334795166    1        3     0
       Name                  --                  --   --       --     0
        Sex                  --                  --   --       --     0
        Age   29.69911764705882  14.516321150817316 0.42     80.0   177
      SibSp  0.5230078563411896  1.1021244350892878    0        8     0
      Parch 0.38159371492704824  0.8056047612452208    0        6     0
     Ticket                  --                  --   --       --     0
       Fare  32.204207968574636    49.6655344447741  0.0 512.3292     0
      Cabin                  --                  --   --       --   687
   Embarked                  --                  --   --       --     2

Cross-tab usign Pandas

In [17]:
pclass_xt = pd.crosstab(df['Pclass'], df['Survived'])
pclass_xt
Out[17]:
Survived 0 1
Pclass
1 80 136
2 97 87
3 372 119
In [18]:
# Pandas
sexes = sorted(df['Sex'].unique())
genders_mapping = dict(zip(sexes, range(0, len(sexes) + 1)))
genders_mapping
Out[18]:
{'female': 0, 'male': 1}
In [19]:
# Astropy
sexes = unique(tab,keys='Sex')['Sex']
genders_mapping = dict(zip(sexes, range(0, len(sexes) + 1)))
genders_mapping
Out[19]:
{'female': 0, 'male': 1}
In [20]:
# Get the unique values of Pclass using Pandas
passenger_classes = sorted(df['Pclass'].unique())

for p_class in passenger_classes:
    print('M: ', p_class, len(df[(df['Sex'] == 'male') & (df['Pclass'] == p_class)]))
    print('F: ', p_class, len(df[(df['Sex'] == 'female') & (df['Pclass'] == p_class)]))
M:  1 122
F:  1 94
M:  2 108
F:  2 76
M:  3 347
F:  3 144
In [21]:
# Get the unique values of Pclass using astropy.table
passenger_classes = unique(tab,keys='Pclass')['Pclass']
for p_class in passenger_classes:
    print('M: ', p_class, len(tab[(tab['Sex'] == 'male') & (tab['Pclass'] == p_class)]))
    print('F: ', p_class, len(tab[(tab['Sex'] == 'female') & (tab['Pclass'] == p_class)]))
M:  1 122
F:  1 94
M:  2 108
F:  2 76
M:  3 347
F:  3 144

Easy conversion to a Pandas dataframe

In [22]:
df2 = tab.to_pandas()
In [23]:
df2.head()
Out[23]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1.0 0.0 3.0 Braund, Mr. Owen Harris male 22.0 1.0 0.0 A/5 21171 7.2500 NaN S
1 2.0 1.0 1.0 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1.0 0.0 PC 17599 71.2833 C85 C
2 3.0 1.0 3.0 Heikkinen, Miss. Laina female 26.0 0.0 0.0 STON/O2. 3101282 7.9250 NaN S
3 4.0 1.0 1.0 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1.0 0.0 113803 53.1000 C123 S
4 5.0 0.0 3.0 Allen, Mr. William Henry male 35.0 0.0 0.0 373450 8.0500 NaN S
In [24]:
df2.shape
Out[24]:
(891, 12)
In [25]:
df2.columns
Out[25]:
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
In [26]:
tab.columns
Out[26]:
<TableColumns names=('PassengerId','Survived','Pclass','Name','Sex','Age','SibSp','Parch','Ticket','Fare','Cabin','Embarked')>
In [27]:
#pd2 = pd.read_excel('Weather.xlsx ')
In [28]:
# SQLite
from sqlalchemy import create_engine
import numpy as np

Pandas/Astropy

Merging

  • concatenate columns/rows: pd.concat()/ hstack([t1,t2]) and vstack([t1,t2])
  • database-style join operation by columns or indexes Dataframes/tables: pd.merge(df1,df2,on='key') / join(left, right[, keys, join_type, …])
  • append a row at the end: pd.append(new_rows ingore_index=True)
In [29]:
df2 = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
s = df2.iloc[3]
df2.append(s, ignore_index=True)
Out[29]:
A B C D
0 -0.633762 -0.623436 -1.010388 -0.090806
1 1.225558 -0.193113 1.887596 0.014025
2 0.896680 0.385825 -0.911641 -0.664368
3 -1.927006 -0.427448 0.470260 -1.116965
4 -0.729664 0.099900 0.338580 -2.023594
5 0.599496 -0.053077 0.839813 -1.504623
6 0.765692 -1.394492 0.939961 -0.908918
7 -0.814391 1.159087 1.214281 1.304610
8 -1.927006 -0.427448 0.470260 -1.116965

Now using Astropy

In [30]:
data_rows = [(1, 2.0, 'x'),
             (4, 5.0, 'y'),
             (5, 8.2, 'z')]
t = Table(rows=data_rows, names=('a', 'b', 'c'), meta={'name': 'first table'},
          dtype=('i4', 'f8', 'S1'))
t
Out[30]:
Table length=3
abc
int32float64bytes1
12.0x
45.0y
58.2z
In [31]:
newrows = Table(rows=[(8, 4.0, 'l'),(-3, 9.0, 'm')], names=('a', 'b', 'c'))
newrows
Out[31]:
Table length=2
abc
int64float64str1
84.0l
-39.0m
In [32]:
from astropy.table import hstack,vstack
vstack([t,newrows]) 
Out[32]:
Table length=5
abc
int64float64str1
12.0x
45.0y
58.2z
84.0l
-39.0m
In [33]:
newcols = Table(rows=[('a2','a3'),(1e4,3.2e3),(-4.,-5.)], names=('d', 'e'))
newcols
Out[33]:
Table length=3
de
str7str6
a2a3
10000.03200.0
-4.0-5.0
In [34]:
hstack([t,newcols]) 
Out[34]:
Table length=3
abcde
int32float64bytes1str7str6
12.0xa2a3
45.0y10000.03200.0
58.2z-4.0-5.0

Pivot table

Data is often stored in CSV files or databases in so-called “stacked” or “record” format.

In [35]:
import pandas.util.testing as tm; tm.N = 3
def unpivot(frame):
    N, K = frame.shape
    data = {'value' : frame.values.ravel('F'),
            'variable' : np.asarray(frame.columns).repeat(N),
            'date' : np.tile(np.asarray(frame.index), K)}
    return pd.DataFrame(data, columns=['date', 'variable', 'value'])
unpivotdf = unpivot(tm.makeTimeDataFrame())
In [36]:
unpivotdf
Out[36]:
date variable value
0 2000-01-03 A -0.786930
1 2000-01-04 A 0.511586
2 2000-01-05 A 0.519832
3 2000-01-03 B -0.378682
4 2000-01-04 B 0.210597
5 2000-01-05 B -0.938252
6 2000-01-03 C 0.192337
7 2000-01-04 C 2.081758
8 2000-01-05 C -0.505425
9 2000-01-03 D -1.961333
10 2000-01-04 D 0.241400
11 2000-01-05 D 0.111863
In [37]:
pivoteddf=unpivotdf.pivot(index='date', columns='variable', values='value')
pivoteddf
Out[37]:
variable A B C D
date
2000-01-03 -0.786930 -0.378682 0.192337 -1.961333
2000-01-04 0.511586 0.210597 2.081758 0.241400
2000-01-05 0.519832 -0.938252 -0.505425 0.111863
In [38]:
pivotedtab=Table.from_pandas(pivoteddf)
In [39]:
pivotedtab
Out[39]:
Table length=3
ABCD
float64float64float64float64
-0.7869295401722648-0.378681535566173950.19233656067833904-1.961332809973836
0.51158558171809470.210597136959695922.0817580440301840.24139957830778125
0.5198322824158546-0.9382517067515813-0.505425270213230.11186257441218021

The index column is missing! It is not passed as column to an astropy table. One has to create add a column using the index values.

In [40]:
# create a dataframe out of the index
df2=pd.DataFrame(pivoteddf.index,index=pivoteddf.index)
In [41]:
# add it to the initial dataframe
pivoteddf2=pd.concat([df2,pivoteddf],axis='columns')
In [42]:
pivotedtab2=Table.from_pandas(pivoteddf2)
In [43]:
pivotedtab2
Out[43]:
Table length=3
dateABCD
datetime64[ns]float64float64float64float64
2000-01-03T00:00:00.000000000-0.7869295401722648-0.378681535566173950.19233656067833904-1.961332809973836
2000-01-04T00:00:00.0000000000.51158558171809470.210597136959695922.0817580440301840.24139957830778125
2000-01-05T00:00:00.0000000000.5198322824158546-0.9382517067515813-0.505425270213230.11186257441218021

Another feature that is absent from Astropy table is a method to generate cross-tables. Generate a cross tab of Pclass and Survived:

In [44]:
pclass_xt = pd.crosstab(df['Pclass'], df['Survived'])
pclass_xt
Out[44]:
Survived 0 1
Pclass
1 80 136
2 97 87
3 372 119

The cross-table is a contingency/frequency table. Subsequent analysis via the Fisher or chi2 tests can be performed on a contingency/frequency table.

In [45]:
from scipy.stats.contingency import margins,chi2_contingency
m0, m1 = margins(pclass_xt)
In [46]:
chi2,p_chi2,dof,expected = chi2_contingency(pclass_xt)
In [47]:
# expected number
dfexp=pd.DataFrame(np.round(expected),index=[pclass_xt.index])
dfexp
Out[47]:
0 1
Pclass
1 133.0 83.0
2 113.0 71.0
3 303.0 188.0
In [48]:
deviation=pd.DataFrame(np.array(pclass_xt)/np.array(dfexp),index=[pclass_xt.index])
deviation
Out[48]:
0 1
Pclass
1 0.601504 1.638554
2 0.858407 1.225352
3 1.227723 0.632979
In [49]:
# Testing HDF5 versus fits
df.to_hdf('titanic_train.hdf5','table')
/Users/Wing-Fai/Projects/python3/env/lib/python3.6/site-packages/pandas/core/generic.py:1471: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block2_values] [items->['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']]

  return pytables.to_hdf(path_or_buf, key, self, **kwargs)
In [50]:
df3 = pd.read_hdf('titanic_train.hdf5','table')
df3.head(2)
Out[50]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
In [51]:
import h5py
x = h5py.File('titanic_train.hdf5','r')
keys = [t for t in x.keys()]
print(keys)
['table']
/Users/Wing-Fai/Projects/python3/env/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
In [52]:
list_of_names = []
x.visit(list_of_names.append)
list_of_names
Out[52]:
['table',
 'table/axis0',
 'table/axis1',
 'table/block0_items',
 'table/block0_values',
 'table/block1_items',
 'table/block1_values',
 'table/block2_items',
 'table/block2_values']
In [53]:
x = h5py.File('H2.hdf5','r')
keys = [t for t in x.keys()]
print(keys)
['README', 'photoabsorption', 'photodissociation', 'photoionisation', 'wavelength']
In [54]:
## load the README text
README = x['README'].value
print( README)
## load data as a numpy array
wavelength = x['wavelength'].value
photodissociation = x['photodissociation'].value
H2 cross sections
From the Leiden database for photodissociation and photoionization of astrophysically relevant molecules, http://home.strw.leidenuniv.nl/~ewine/photo/
File last modified by Alan Heays 2017-03-01
references: abgrall_etal2000, chan1992 10.1016/0301-0104(92)87171-5, backx1976 http://stacks.iop.org/0022-3700/9/i=2/a=018,
In [55]:
#help("h5py.File")

Write/Read HDF5 into astropy table

I have failed so far to write the Astropy table into a HDF5 file.

In [56]:
#tab.write('titanic_train_astropy.hdf5',format='hdf5',path='table')

Saving a table in excel format using Pandas

In [57]:
df.to_excel('titanic_train.xls')

Write fits with astropy table

In [58]:
tab.write('titanic_train.fits', format='fits',overwrite=True)
In [59]:
tab2=Table.read('titanic_train.fits', format='fits')
In [60]:
tab2[0:2]
Out[60]:
Table masked=True length=2
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
yr
int64int64int64bytes82bytes6float64int64int64bytes18float64bytes15bytes1
103Braund, Mr. Owen Harrismale22.010A/5 211717.25N/AS
211Cumings, Mrs. John Bradley (Florence Briggs Thayer)female38.010PC 1759971.2833C85C

Comparison hdf5/fits/csv

  • the hdf5 file written by Pandas has a size of 1.1M
  • the fits file written by Astropy is 166K
  • the original csv files is 59K

Summary

  • Astropy handles fits input/output of tables but may lack data manipulation methods present in Pandas
  • Astropy does not deal with HDF5 file format well.
  • CSV files are read and write without issues by both packages
  • My recommendation would be to use fits format as it is de facto the standard format in astronomy and for non-astronomers to try this well maintained robust format for binary images and tables. The transformation between Astropy table and pandas Dataframes is easy and one should take adavantage of the strength of each package, which complement each other pretty well.