Data Science and Computing with Python for Pilots and Flight Test Engineers
Reading in Data from a File
This lesson focuses on how to read data from a file into our Python program, such that we can analyze and display it further. It is assumed in this lesson that the data is present in a tabular form (i.e. a table). We will cover how to read in images and videos in a different lesson, when we start discussing computer vision.
There are many ways to do this. We will use DataFrames from the Pandas library here, because reading data into them is super simple, and they will prove convenient for data preprocessing afterwards as well. Pandas is an excellent tool for data manipulation.
But first, we need to create ourselves a plain text data file with some example data for us to read in…
The Data File
Many data file formats exist. We will introduce here only one, the comma separated value (CSV) format. A CSV file is a plain text file, which stores tabular data separated by commas.
This format is convenient, because we can open and edit with any text editor (such as the application TextEdit on a Mac). It is also very popular and widespread. For instance, if you have a Microsoft Excel spreadsheet, Excel gives you an option to export this spreadsheet as a plain text file with the data in it in CSV format.
The idea behind the CSV format is very simple. Each line in a file corresponds to a row in a table. The entries in each row (i.e. in the individual table columns) are separated by a comma. If you wish, you can use a different character as a delimiter instead of a comma, but the comma is the default. Note that spaces are also read as column content if the entry is a string, so when you create the file by hand, it is best not to make a space after a comma for the next entry.
We show an example of a typical CSV data file below. The data is fictitiously made up, based on what you might experience on your job as a flight instructor. Please copy the text in the code cell below, paste it into the window of a text editor, and save it as plain text file (not in Microsoft Word format, Rich Text Format (RTF), etc.). Give the file the name “example_data1.csv” and put it in the same directory as your Jupyter notebook, with which you are working. You can use Microsoft Word to create the file first, but when you save the file, export it to plain text (i.e. with a .txt or .csv file extension) not as an MS Word document (with .doc or .docx extension).
Weekday,Date,Flights,Flight Time,Comments
Monday,2024-01-22,3,1.5,Was late for the airport in the morning.
Tuesday,2024-01-23,1,2.8,"Student came unprepared."
Wednesday,,,
Thursday,2024-01-25,5,7.3,"It was windy, but I was able to handle it."
Friday,2024-01-26,2,3.2,Ramp checked by the FAA.
Saturday,2024-01-27,0,,
Sunday,2024-01-28,1,2.3, "Stage check for Borring Persson from Sweden."
Monday,2024-01-29,3,1.5,Was late for the airport again. Monday mornings...
Tuesday,2024-01-30,1,2.8,"Student came unprepared again, we flew in circles."
Wednesday,2024-01-31,5,3.2,"Four intro flights and one spin training flight."
Thursday,2024-02-01,5,7.3,
Friday,2024-02-02,2,1.5,"Student flew so low. Took her out with a coconut."
Saturday,2024-02-03,0,0,Plane lost door. Boeing a pilot makes me nervous.
Sunday,2024-02-04,1,1.3,"Storm Hunter's aerobatic competition prep."
Note a few peculiarities in the CSV file text above, which we want to highlight. The first line contains the names of the columns, not data. The data actually starts only on the second line. Commas separate the individual fields, without spaces after the comma. Strings (text entries) can be put in quotations marks, but they do not have to. There is one exception. The comment on Thursday (Line 5) has to be in quotation marks, because its string contains a comma; if we did not put that entry in quotation marks, the comma would be interpreted as a delimiter, indicating a new column starting, which is not what we want here.
On the content side, note that the data has some gaps in the entries. It may also be more convenient to have the first column being the date, because it numbers the rows sequentially. These will all be cases for data preprocessing, before we can commence analysis. We will learn this in the next lesson.
Now we still need to learn how to read in the above file into our program with Python in the first place.
The Python Code
Reading in the file into our Python program and storing its content in a variable can be accomplished by a single line of code, once we have imported the required library (in this case, we decided to use Pandas).
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
Create a Pandas DataFrame (called df in the example) and read in the data from a CSV file into it, using the Pandas read_csv() function:
df = pd.read_csv('./example_data1.csv')
To display the data within the Jupyter notebook, simply put the Pandas DataFrame on a line on its own at the very bottom of a code cell:
df
Copy and paste the above code into your computer, and notice that the columns in the file have been automatically labeled with the labels provided in the first row of the file. Usually this is what you want, but we will see below, how to change this behavior, if needed.
Pandas reads in numerical values automatically as numbers. Entries containing text will be stored as strings.
Data Read in Options
Sometimes our CSV file is not formatted exactly the way we want, and we may have to change some of the parameters during read in. Typical such cases are:
use of a different character as a delimiter (instead of a comma, using spaces or tabs is quite popular in data files), and non-existent column names in the first row of the file, or column names which we would like to change.
- use of a different character as the delimiter (instead of a comma, using spaces or tabs is quite popular in data files),
- non-existent column names in the first line of the file (i.e. the first line of the file is already data),
- or column names in the first line, which we would like to change during read-in.
We cover these cases below.
Changing the Delimiter
You can change the character that denotes the delimiter, if desired. Default is a comma, but you can change it, for instance to a semicolon with either one of the two lines below:
df = pd.read_csv('./example_data1.csv', sep=';')
df = pd.read_csv('./example_data1.csv', delimiter=';')
Changing the Header Names
First of all, what if the CSV file has no header names in the first line and starts directly with data entries? In such a case, we do not want the first line of the data file to be misinterpreted as column names (and thus used as labels and excluded from the data body itself in the Pandas DataFrame).
We can tell Pandas that there is no column name line at the top of the CSV file, and that the first line is already data, by reading in the file with:
df = pd.read_csv('./example_data1.csv', header=None)
If column names are present in the first line of the CSV file, but if we do not like them and want to give the data our own column names during read-in, instead of using the ones in the file, we can do so as follows:
df = pd.read_csv('./example_data1.csv', header=0, names=['Day', 'Date', 'Number of Flights', 'Time', 'Remarks'])
The above reads in the data from the file and renames the Pandas DataFrame columns to Day, Date, Number of Flights, Time, and Remarks, respectively.
Such a renaming during read-in is particularly useful, for instance, if you read in data from different sources, each labeling the columns slightly differently, and if you want to standardize the column names within your Python code, in order to be able to combine the data during analysis.
Columns in Pandas DataFrames can, of course, also be renamed later, but here we are showing you how you can do it directly during read in, if desired.
Of course, reading in data may require different techniques than the ones covered above, but the above is sufficient for a surprisingly large number of cases. We will learn how to read in images and videos with Python later, when we discuss computer vision.