What is meant by ’tidy’ data?
When processing and plotting data, how you choose your columns can have a massive impact on how easy your data is to manipulate. Data can either be in ’long’ (or ’tidy’) form, or it can be in ‘wide’ form. Some plotting libraries are designed to work with ’long’ data, and others with ‘wide’ data.
Long data
A table stored in ’long’ form has a single column for each variable in the system.
In the case of UK election results, each data point represents the number of seats a particular party won in a
particular year, so our variables are seats
, party
and year
:
>> print(long)
year party seats
0 1966 Conservative 253
1 1970 Conservative 330
2 Feb 1974 Conservative 297
3 Oct 1974 Conservative 277
4 1979 Conservative 339
.. ... ... ...
55 2005 Others 30
56 2010 Others 29
57 2015 Others 80
58 2017 Others 59
59 2019 Others 72
[60 rows x 3 columns]
‘Long-form’ data is also sometimes called ’tidy’, ‘stacked’ or ’narrow’.
Libraries that work best with long data: Seaborn, Plotly Express, Altair
Wide data
A table stored in ‘wide’ form spreads a variable across several columns. We have
the number of seats won by four parties (including others
), so it seems sensible to store them in four columns:
>> print(wide)
year conservative labour liberal others
0 1966 253 364 12 1
1 1970 330 287 6 7
2 Feb 1974 297 301 14 18
.. ... ... ... ... ...
12 2015 330 232 8 80
13 2017 317 262 12 59
14 2019 365 202 11 72
[15 rows x 5 columns]
‘Wide form’ data is also sometimes called ‘un-stacked’.
Libraries that work best with wide data: Matplotlib, Plotly, Bokeh, PyGal, Pandas.
Pandas DataFrames
All the libraries mentioned in our Python plotting guide work well with Pandas DataFrames, so I’ve created DataFrames from my data.
Pandas DataFrames allow you to manipulate large amounts of tabulated data in a scalable way, providing methods to
iterate through columns, filter out particular values, replace missing values, and many other operations you want to
do efficiently. Their columns are also Python Sequences, so they can usually be used anywhere you’d use a Python list
.
Converting between Long and Wide data in Pandas
Pandas has convenient methods to convert wide-form data in to long form and vice versa.
Wide to long
To convert wide form into long form, use df.melt()
:
>> # Convert wide form to long form
>> melted = wide.melt('year', var_name='party', value_name='seats')
>> print(melted)
year party seats
0 1966 conservative 253
1 1970 conservative 330
2 Feb 1974 conservative 297
3 Oct 1974 conservative 277
4 1979 conservative 339
.. ... ... ...
55 2005 others 43
56 2010 others 25
57 2015 others 80
58 2017 others 59
59 2019 others 72
For more detail, see the Pandas documentation on melt.
Long to wide
To convert long form into wide form, use df.pivot().reset_index()
>> # Convert long form to wide form
>> widened = long.pivot(index='year', columns='party', values='seats').reset_index()
>> print(widened)
party year Conservative Labour Liberal Others
0 1966 253 364 12 1
1 1970 330 287 6 7
2 1979 339 269 11 16
.. ... ... ... ... ...
12 2019 365 202 11 72
13 Feb 1974 297 301 14 18
14 Oct 1974 277 313 13 32
For more detail, see the Pandas documentation on pivot.
More on plotting in Python
For a full comparison of Python plotting libraries, see Plotting in Python: A Rundown of Libraries.