Python : CSV data analysis
In this post we will be doing some simple data analysis from CSV files using core-python API’s. We will not be using any data libraries. This will give us a chance to excercise some of the core python features like sorting, groupby and comprehensions. This post is an extension to my previous post generating dictionary from CSV.
What is the data we are analysing ?
Data we are dealing with are bunch of small csv files, which contains some stock portfolios as shown below. It contains
name, date of purchase, number of shares purchased and price
at the time of purchase. A sample file looks like this.
**Name,Date,Shares,Price**
HPQ,7/11/2007,100,32.2
IBM,7/12/2007,50,91.9
GE,7/13/2007,150,83.44
CAT,7/14/2007,200,51.23
MSFT,7/15/2007,95,40.37
HPE,7/16/2007,50,65.1
AFL,7/17/2007,100,70.44

There will be multiple files residing in a folder (‘stocks’) in the same format as shown above, so we have to read all files and
process it. As a first step, let us read all csv files and generate a list of dictionary from it. We will be using the csvreader
module we created before.
def get_portfolio_as_dict(path):
pattern = '*.csv'
files = glob.glob(os.path.join(path,pattern))
portfolios = []
for file in files:
for row in csvreader.read_csv(file,[str, str, int, float]):
portfolios.append(row)
return portfolios
get_portfolio_as_dict(path)
expects a folder location where all ‘csv’ files resides. We are using glob
module
which returns a list of files that matches the pattern as shown in line:3
. After that we read’s each csv
file and convert each row in that file to a dictionary and appends it to a list and returns the record. A snap shot
of output looks like this .
records = get_portfolio_as_dict('stocks')
pprint.pprint(records)
# output
[{'Date': '7/11/2007', 'Name': 'HPQ', 'Price': 32.2, 'Shares': 100},
{'Date': '7/12/2007', 'Name': 'IBM', 'Price': 91.9, 'Shares': 50},
{'Date': '7/13/2007', 'Name': 'GE', 'Price': 83.44, 'Shares': 150},
{'Date': '7/14/2007', 'Name': 'CAT', 'Price': 51.23, 'Shares': 200},
{'Date': '7/15/2007', 'Name': 'MSFT', 'Price': 40.37, 'Shares': 95},
{'Date': '7/16/2007', 'Name': 'HPE', 'Price': 65.1, 'Shares': 50},
{'Date': '7/17/2007', 'Name': 'AFL', 'Price': 70.44, 'Shares': 100},
{'Date': '7/11/2007', 'Name': 'GOOGL', 'Price': 55.2, 'Shares': 100},
{'Date': '7/12/2007', 'Name': 'IBM', 'Price': 91.9, 'Shares': 50},
{'Date': '7/13/2007', 'Name': 'GE', 'Price': 83.44, 'Shares': 150},
{'Date': '7/14/2007', 'Name': 'CAT', 'Price': 51.23, 'Shares': 200},
{'Date': '7/15/2007', 'Name': 'MSFT', 'Price': 40.37, 'Shares': 95},
{'Date': '7/16/2007', 'Name': 'HPE', 'Price': 65.1, 'Shares': 50},
{'Date': '7/17/2007', 'Name': 'AFL', 'Price': 70.44, 'Shares': 100},
{'Date': '7/11/2007', 'Name': 'GOOGL', 'Price': 55.2, 'Shares': 100},
{'Date': '7/12/2007', 'Name': 'IBM', 'Price': 91.9, 'Shares': 50},
{'Date': '7/13/2007', 'Name': 'GE', 'Price': 83.44, 'Shares': 150},
{'Date': '7/14/2007', 'Name': 'TCS', 'Price': 51.23, 'Shares': 200},
{'Date': '7/15/2007', 'Name': 'MSFT', 'Price': 43.37, 'Shares': 95},
{'Date': '7/16/2007', 'Name': 'INFY', 'Price': 62.1, 'Shares': 50}]
Get Holding Names
Now let us grab all unique holding names from the portfolios. We are using
set comprehension
to collect unique names.
port_names = get_portfolio_names('stocks')
print(port_names)
def get_portfolio_names(path):
portfolios = get_portfolio_as_dict(path)
names = { p['Name'] for p in portfolios}
return names
output..
{'INFY', 'HPE', 'CAT', 'AFL', 'GE', 'TCS', 'GOOGL', 'HPQ', 'IBM', 'MSFT'}
Total Value of holdings
def get_total_value(path):
portfolios = get_portfolio_as_dict(path)
total = sum([p['Shares'] * p['Price'] for p in portfolios])
return total
total_value = get_total_value('stocks')
print('Total value is {}'.format(total_value))
output..
Total value is 131824.44999999998
Current value of each holding
In this case we are using request
module to grab latest values from yahoo url.
def get_current_value(path):
names = get_portfolio_names(path)
yahoo_url = 'http://finance.yahoo.com/d/quotes.csv?s={}&f=l1'.format(names)
cur_values = requests.get(yahoo_url).text
cur_values = cur_values.split()
name_value = zip(names, cur_values)
for name,value in name_value:
print(name,value)
output..
GE 29.565
CAT 94.46
HPE 18.10
GOOGL 859.66
HPQ 18.38
MSFT 66.315
INFY 14.385
AFL 74.23
TCS 4.15
IBM 160.81
Group By
Since our data is spread across multiple files let us group by Name. We are using
itertools.groupby
in which you can specify a key to be used for grouping using
a lambda
.
def print_sorted_portfolios(path):
portfolios = get_portfolio_as_dict(path)
portfolios.sort(key=lambda x: x['Name'])
for name, items in itertools.groupby(portfolios, lambda x:x['Name']):
print(name)
for item in items:
print(' ', item)
output..
AFL
{'Shares': 100, 'Date': '7/17/2007', 'Price': 70.44, 'Name': 'AFL'}
{'Shares': 100, 'Date': '7/17/2008', 'Price': 70.44, 'Name': 'AFL'}
CAT
{'Shares': 200, 'Date': '7/14/2007', 'Price': 51.23, 'Name': 'CAT'}
{'Shares': 200, 'Date': '7/14/2008', 'Price': 51.23, 'Name': 'CAT'}
GE
{'Shares': 150, 'Date': '7/13/2007', 'Price': 83.44, 'Name': 'GE'}
{'Shares': 150, 'Date': '7/13/2008', 'Price': 83.44, 'Name': 'GE'}
{'Shares': 150, 'Date': '7/13/2009', 'Price': 83.44, 'Name': 'GE'}
GOOGL
{'Shares': 100, 'Date': '7/11/2008', 'Price': 55.2, 'Name': 'GOOGL'}
{'Shares': 100, 'Date': '7/11/2009', 'Price': 55.2, 'Name': 'GOOGL'}
HPE
{'Shares': 50, 'Date': '7/16/2007', 'Price': 65.1, 'Name': 'HPE'}
{'Shares': 50, 'Date': '7/16/2008', 'Price': 65.1, 'Name': 'HPE'}
HPQ
{'Shares': 100, 'Date': '7/11/2007', 'Price': 32.2, 'Name': 'HPQ'}
IBM
{'Shares': 50, 'Date': '7/12/2007', 'Price': 91.9, 'Name': 'IBM'}
{'Shares': 50, 'Date': '7/12/2008', 'Price': 91.9, 'Name': 'IBM'}
{'Shares': 50, 'Date': '7/12/2009', 'Price': 91.9, 'Name': 'IBM'}
INFY
{'Shares': 50, 'Date': '7/16/2009', 'Price': 62.1, 'Name': 'INFY'}
MSFT
{'Shares': 95, 'Date': '7/15/2007', 'Price': 40.37, 'Name': 'MSFT'}
{'Shares': 95, 'Date': '7/15/2008', 'Price': 40.37, 'Name': 'MSFT'}
{'Shares': 95, 'Date': '7/15/2009', 'Price': 43.37, 'Name': 'MSFT'}
TCS
{'Shares': 200, 'Date': '7/14/2009', 'Price': 51.23, 'Name': 'TCS'}
Value of each holding
Now we grouped by name, we can easily calculate total value of each holding.
def get_portfolio_values(path):
portfolios = get_portfolio_as_dict(path)
portfolios.sort(key=lambda x: x['Name'])
holding_values = {}
for name, items in itertools.groupby(portfolios, lambda x: x['Name']):
total_value = 0
for item in items:
total_value += item['Shares'] * item['Price']
holding_values[name] = total_value
return holding_values
output..
{'CAT': 20492.0, 'GOOGL': 11040.0, 'IBM': 13785.0,
'MSFT': 11790.449999999999, 'INFY': 3105.0,
'GE': 37548.0, 'HPQ': 3220.0000000000005, 'AFL': 14088.0,
'HPE': 6509.999999999999, 'TCS': 10246.0}
Top Valued
Now we got a dictionary of each portfolio and its values. Now to calculate the top one we
have to sort the dictionary. Python don’t have a straight forward way to sort a dictionary,
so we are using sorted
method in which we are passing the key to be sorted as the Price.
def get_top_portfolio(path):
portfolio_values = get_portfolio_values(path)
sorted_by_value = sorted(portfolio_values.items(), key=lambda x:x[1])
return sorted_by_value[-1]
output..
('GE', 37548.0)
Coding is fun enjoy…