A Crash Course in Python for Data Scientists

Kira Kowalska, UCL 2015

Python Philosophy

In [1]:
import this
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

Getting started

We need two components to kick off scientific programming in Python:

  • Python
  • IPython notebook (interactive computational environment, just like RStudio for R programmers)

If you already have Python

If you already have Python installed and are familiar with installing packages, you can get IPython with pip:

In [ ]:
pip install ipython

Or if you want to also get the dependencies for the IPython notebook:

In [ ]:
pip install "ipython[notebook]"

If you are getting started with Python

If you are a new user and would like to install a full Python environment for scientific computing and data science, it is best to install the Anaconda distribution, which provides Python 2.7, IPython and all of its dependences as well as a complete set of open source packages for scientific computing and data science.

  1. Download and install Continuum’s Anaconda.
  2. Update IPython to the current version using the Terminal:
In [ ]:
conda update conda
conda update ipython ipython-notebook ipython-qtconsole

Testing your installation

Once you have Python and IPython notebook installed, please test your installation by typing in the Terminal:

In [ ]:
ipython notebook

This should open a browser window showing an almost plain website and looks similar to this one:

Python Basics

Python as a calculator

In [2]:
2+2
Out[2]:
4
In [3]:
(50-5*6)/4
Out[3]:
5
In [5]:
7/3
Out[5]:
2
In [6]:
7.0/3
Out[6]:
2.3333333333333335

Basic Data Structures

You don’t need to specify variable types. Types are evaluated at runtime. You can check what they are using type(var).

In [4]:
a = 10
b = 20.0
c = "It's just a text!"
d = True
In [5]:
type(a)
Out[5]:
int
In [9]:
type(b)
Out[9]:
float
In [10]:
type(c)
Out[10]:
str
In [11]:
type(d)
Out[11]:
bool

Lists

Data structure for storing multiple objects.

In [12]:
a = [1, 2, 3, 4, 5, 6]
b = [1, "Adam", True] # can be heterogenous

List properties, list methods

In [13]:
## List properties
print max(a)
print min(a)
print sum(a)
print len(a)
6
1
21
6
In [14]:
## List methods
a.append(99) # Append an element
a.remove(2) # Remove the first occurrence of 2
a.insert(10, "HI") # Insert "HI" in position 10
a.count(99) # Count the occurrences
a.reverse()

You can explore all methods interactively using dir(a) or dir(list). You can find out what a particular method does by using help(fun).

In [15]:
help(a.append)
Help on built-in function append:

append(...)
    L.append(object) -- append object to end

List slicing, list comprehension

In [16]:
# Slicing
a = range(100)  # range command creates lists of numbers [0,1,2,3,...,98,99]  
a[2]
a[:5] # First 5 elements
a[:-5] # Last 5 elements
a[:20:2] # Every second element, up to 20.
a = 
Out[16]:
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

Note: if there are multiple commands in one cell, only the last one is printed. Alternatively, use print:

In [6]:
a = range(100)
print a[2]
print a[:5] # First 5 elements
print a[:-5] # Last 5 elements
print a[:20:2] # Every second element, up to 20.
2
[0, 1, 2, 3, 4]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94]
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
In [8]:
# Comprehension 
a_even = [number for number in a if number % 2 == 0]   # Pythonic style, readability counts.

Sets

Sets are unordered collections of unique elements. Use them you want to make sure there are no duplicates and when you don’t care about the order.

In [19]:
a = set()
a.add(1)
a.add(1)
print a
set([1])
In [20]:
list_with_repetitions = [1, 2, 3, 3, 3, 2, 2, 2]
print set(list_with_repetitions) # Eliminate duplicates
set([1, 2, 3])
In [10]:
languages = {'Python','R','Java'}    # you can also define sets using curly brackets
In [11]:
if 'Python' in languages:
    print "Python is a programming language!"
Python is a programming language!

Dictionaries

Dictionaries are unordered data structures that map keys to values.

In [20]:
users_age = {}
users_age["John"] = 24
users_age["Alice"] = 22
users_age["Alice"] = 20
In [22]:
print users_age["Alice"]
20
In [23]:
print users_age.keys()
['John', 'Alice']
In [24]:
print users_age.values()
[24, 20]

Strings

In [1]:
fruits = "Apple,Kiwi,Watermelon"
In [2]:
 fruitlist = fruits.split(",")
In [3]:
fruitlist
Out[3]:
['Apple', 'Kiwi', 'Watermelon']
In [4]:
fruits = (':').join(fruitlist)
In [5]:
fruits
Out[5]:
'Apple:Kiwi:Watermelon'
In [6]:
fruits.lower()
Out[6]:
'apple:kiwi:watermelon'
In [7]:
fruits.upper()
Out[7]:
'APPLE:KIWI:WATERMELON'

Program Flow: if, for, while

If

In [21]:
if "Alice" in users_age:
    print "Alice's age is", users_age["Alice"]
Alice's age is 20
In [22]:
if 6<3:
    print "Oops"

For

In Python, for loops iterate over sequences.

In [34]:
people = ["Alice", "John", "Bob", "Ed"]
for person in people:    # Choose your names wisely, Python is about readability.
    print person
Alice
John
Bob
Ed
In [26]:
temp = range(5,100,5)

If you want a more R-style counter over sequences, use enumerate(list).

In [35]:
for i, person in enumerate(people):    # Choose your names wisely, Python is about readability.
    print i
    print person
0
Alice
1
John
2
Bob
3
Ed

While

Use while loop when you are not sure how many iterations you are going to have (for example because they depend on user input or other events).

In [28]:
n = raw_input("Please enter 'hello':")
while n.strip() != 'hello':
    n = raw_input("Please enter 'hello':")
Please enter 'hello':gres
Please enter 'hello':egew
Please enter 'hello':hello

Functions

In [8]:
def multiply_six(n):
    return n*8
In [9]:
multiply_six(3)
Out[9]:
24
In [10]:
multiply_six("python")
Out[10]:
'pythonpythonpythonpythonpythonpythonpythonpython'

Functions can have optional arguments:

In [35]:
dimensions = [12,3,1]
In [37]:
get_volume(*dimensions)
Out[37]:
36
In [36]:
def get_volume(x,y,z=1):
    
    return x*y*z
In [33]:
get_volume(2,3)
Out[33]:
6
In [34]:
get_volume(2,4,5)
Out[34]:
40

Classes

Python is an object-oriented programming language, so it treats everything as an object. For example, lists are objects with properties such as len(), max(), min() and methods such as extend(), append().

Classes give you the ability to define your own objects just like def gives you the ability to define your on functions.

In [38]:
class Product(object):
    """A product offered by ABC Shop. Products can have the
    following properties:
        name: The name of the product.
        category: The category the product belongs to.
        ranking: List storing current rankings of the product.
        price: The current price of the product.
    """

    def __init__(self, name, category, rankings, price):
        """Create a Product object."""
        self.name = name
        self.category = category
        self.rankings = rankings
        self.price = price

    def discount(self, amount):
        """Discount the product by *amount*"""
        if amount > self.price:
            raise RuntimeError('Discount greater than current price.')
        self.price -= amount
        return self.price

    def rank(self, ranking):
        """Add new ranking to the product's list of rankings"""
        self.rankings.append(ranking)
        return self.rankings
In [39]:
table = Product(name='Table X',category='furniture',rankings=[],price=150)
In [41]:
table.price
Out[41]:
150
In [42]:
table.category
Out[42]:
'furniture'
In [43]:
table.rank(5)
Out[43]:
[5]
In [44]:
table.rankings
Out[44]:
[5]
In [45]:
table.price
Out[45]:
150
In [46]:
table.discount(40)
Out[46]:
110
In [47]:
table.price
Out[47]:
110

Modules

A module is a file containing Python definitions and statements (variables, functions, classes, ...). Python has a rich collections of modules for data analysis.

In [52]:
import math 

After importing a module, you can access its functions with a ., such as:

In [53]:
math.sqrt(5)
Out[53]:
2.23606797749979

If you don't want to write out entire module names to access its functions, you can import it with an alias.

In [48]:
import numpy as np
In [49]:
np.sqrt(5)
Out[49]:
2.2360679774997898

You can also import selected items in the current namespace. However, it is not recommended as exising functions with the same names would be overwritten.

In [56]:
from math import * # imports all definitions
In [57]:
sqrt(5)
Out[57]:
2.23606797749979
In [58]:
from math import sqrt # selected definitions
In [59]:
sqrt(5)
Out[59]:
2.23606797749979
In [60]:
from numpy import sqrt
In [61]:
??sqrt

Files, I/O

A very common file format for Python is text files. They can be easily manipulated using built-in read() and write() functions.

In [50]:
f = open("foo.txt", "w")  # writes to file
f.write("Hello!\n")
f.close()

Don’t forget to close after writing. It is suggested to use the following format (in which file closes authomatically):

In [63]:
with open("foo.txt", "w") as f:  # writes to file
    f.write("Hello!\n")
# Closes automatically
In [64]:
with open("foo.txt", "a") as f:  # amends existing file
    f.write("Another Hello!\n")
In [65]:
with open("foo.txt", "r") as f:  # reads from existing file
    for line in f:
        print line.rstrip()
Hello!
Another Hello!
In [66]:
with open("foo.txt", "r") as f:  
    for line in f:
        print line.rstrip().split(' ')
['Hello!']
['Another', 'Hello!']

Pickling

Text files are convenient when data needs to be exchanged with other programs. However, if your work is limited to Python only, you might prefer to store Python data structures directly as binary files. This process is referred to as pickling.

In [11]:
from cPickle import dump, load
In [12]:
l = ["a", "list", "with", "text", [13, 23, 3.14], False]
In [13]:
with open("my_list.pkl", "wb") as f:
    dump(l, f)
In [14]:
with open("my_list.pkl", "rb") as f:
    l = load(f)
l
Out[14]:
['a', 'list', 'with', 'text', [13, 23, 3.14], False]

Working with numerical data: NumPy/SciPy & Matplotlib

Python has two extensive libraries for dealing with numerical data (NumPy and SciPy). Together, they give Python roughly the same capability that R or Matlab program offer.

  • NumPy: extensive library for dealing with numerical data, provides n-dimensional arrays for storing and munipulating numerical data.
  • SciPy: built on top of NumPy, provides additional functionality for mathematical operations (e.g. Integration, Optimisation, Interpolation).
  • Matplotlib: flexible plotting library, provides publication quality figures.
In [15]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 
# the last line ensures that plots are displayed inline

Creating numerical arrays (i.e. matrices, vectors)

Fundamental to both Numpy and Scipy is the ability to work with vectors and matrices. You can create vectors from lists using the array command:

In [68]:
np.array([1,2,3,4,5,6])
Out[68]:
array([1, 2, 3, 4, 5, 6])
In [69]:
np.zeros(shape=(2,3),dtype='int') # there are always two inputs: shape and data type
Out[69]:
array([[0, 0, 0],
       [0, 0, 0]])
In [70]:
np.ones(shape=(1,3),dtype='float')
Out[70]:
array([[ 1.,  1.,  1.]])
In [71]:
np.eye(5)
Out[71]:
array([[ 1.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  1.]])

You can also directly import numerical data from a file into an array:

In [51]:
data = np.loadtxt("experiment.csv", dtype='float', delimiter=' ')

Basic matrix operations

In [73]:
a = np.array([[1,2],[3,4]])
b = 2*np.ones((2,1))  # Element-wise operation
In [74]:
a
Out[74]:
array([[1, 2],
       [3, 4]])
In [75]:
b
Out[75]:
array([[ 2.],
       [ 2.]])
In [76]:
a+b    # matrix operation
Out[76]:
array([[ 3.,  4.],
       [ 5.,  6.]])
In [77]:
np.sum(a)
Out[77]:
10
In [78]:
np.sum(a,0)   # column sum
Out[78]:
array([4, 6])
In [79]:
np.sum(a,1)    # row sum
Out[79]:
array([3, 7])
In [80]:
np.prod(a)
Out[80]:
24
In [81]:
np.
Out[81]:
2.5

Basic plotting

In [82]:
plt.plot([1,2,3,4],[4,2,6,1])
Out[82]:
[<matplotlib.lines.Line2D at 0x1061afc50>]
In [83]:
plt.plot([1,2,3,4],[4,2,6,1], 'o')
Out[83]:
[<matplotlib.lines.Line2D at 0x10651d4d0>]
In [84]:
plt.hist(data[:,0])
Out[84]:
(array([   3.,    7.,   23.,   61.,   91.,  108.,  107.,   65.,   28.,    7.]),
 array([ 1.46497089,  1.50822648,  1.55148207,  1.59473766,  1.63799325,
         1.68124885,  1.72450444,  1.76776003,  1.81101562,  1.85427121,
         1.8975268 ]),
 <a list of 10 Patch objects>)
In [85]:
gauss_data = """\
-0.9902286902286903,1.4065274110372852e-19
-0.7566104566104566,2.2504438576596563e-18
-0.5117810117810118,1.9459459459459454
-0.31887271887271884,10.621621621621626
-0.250997150997151,15.891891891891893
-0.1463309463309464,23.756756756756754
-0.07267267267267263,28.135135135135133
-0.04426734426734419,29.02702702702703
-0.0015939015939017698,29.675675675675677
0.04689304689304685,29.10810810810811
0.0840994840994842,27.324324324324326
0.1700546700546699,22.216216216216214
0.370878570878571,7.540540540540545
0.5338338338338338,1.621621621621618
0.722014322014322,0.08108108108108068
0.9926849926849926,-0.08108108108108646"""

# Store data as a list
data = []
for line in gauss_data.splitlines():
    words = line.split(',')
    data.append(map(float,words))
    
# Convert the list to a numpy array
data = np.array(data)

# Plot
plt.plot(data[:,0],data[:,1],'bo')
Out[85]:
[<matplotlib.lines.Line2D at 0x1068b4850>]

Curve fitting

The data in the plot above looks like Gaussian. We can use the curve_fit function from Scipy to fit Gaussian to the data. The function can take any arbitrary function and fit it to the data.

First, we need to define a general Gaussian function to fit.

In [86]:
def gaus(x,a,x0,sigma):
    return a*np.exp(-(x-x0)**2/(2*sigma**2)) 

Now, fit its parameters to the data using curve_fit.

In [87]:
from scipy.optimize import curve_fit

params,conv = curve_fit(gaus,data[:,0],data[:,1])
plt.plot(data[:,0],data[:,1],'ko', label="Original Data")
plt.plot(data[:,0],gaus(data[:,0],*params),'r-', label="Fitted Curve")
plt.legend()
Out[87]:
<matplotlib.legend.Legend at 0x1077ae990>

Working with text or mixed data formats: Pandas

Python is also a good programming language for working with structured data of mixed types. Its library Pandas gives Python SQL-like funcitonality for quick data quering and stats on very large datasets.

In [54]:
import pandas as pd

Pandas dataframes

Pandas offers its own data structure for storing data called dataframes. It implements most common SQL commands and is somehow similar to R dataframes. It is particularly useful for:

  • geting a first insight into the data,
  • performing data cleaning,
  • calculating basic stats,
  • exporting to a desired format or converting to a numpy array for a further analysis.

Importing data into Pandas dataframes could not be easier:

In [55]:
tweets = pd.read_csv('Twitter_elections2015_with_noise.csv')
In [56]:
tweets
Out[56]:
lat lon date text party
1 52.676707 -1.824675 2015-05-05 13:54:24 My #vote is going to the #Conservative #party ... conservative
2 53.804250 -1.762647 2015-05-05 14:00:21 #miliband might break #Edstone pledges http://... labour
3 53.210070 -0.606384 2015-05-05 14:01:24 Just bumped into a man from UKIP on the electi... ukip
4 51.499266 -0.124436 2015-05-05 14:05:07 #GE2015: #LibDems would give biggest party tim... libdems
5 51.598158 -0.039050 2015-05-05 14:16:33 @danielacalota1 I feel no love for #Labour MPs... labour
6 51.598160 -0.039425 2015-05-05 14:17:32 @danielacalota1 course. But the Revolution don... labour
7 51.730727 0.474504 2015-05-05 14:42:55 #UKIP candidate for @Maldon_ is married to an... ukip
8 51.532306 -0.201850 2015-05-05 16:05:47 @pimpmytweeting @Rothwelllad001 @Boadicea51 @... ukip
9 51.499266 -0.124436 2015-05-05 17:13:39 #GE2015: #Labour manifesto 2015: the key polic... labor
10 51.556614 -0.227828 2015-05-05 17:24:17 @john_r_connolly #Conservatives #Tories #saman... conservative
11 53.790048 -1.648777 2015-05-05 17:44:21 Fucking #Labour.Stop posting campaign shit thr... labour
12 NaN NaN 2015-05-05 17:44:37 #GE2015: @David_Cameron Photobombed By #UKIP S... ukip
13 51.499266 -0.124436 2015-05-05 18:48:11 #GE2015: ##Conservative have two-point lead ov... conservative
14 51.465129 0.172980 2015-05-05 19:14:16 Well the #itv quiz still didnt help me pick be... conservative
15 53.743847 -2.711041 2015-05-05 20:20:53 Staying home from uni this week just so I can ... labour
16 52.484385 -2.147786 2015-05-05 22:12:01 This #UKIPDoArcticMonkeys is actually killing ... ukip
17 55.938064 -4.721904 2015-05-05 22:17:28 Since it wouldnt send for some reason (hmm) #U... ukip
18 51.499266 -0.124436 2015-05-05 22:28:42 #GE2015: #Labour warns of widespread teacher s... labour
19 56.129323 -3.124893 2015-05-05 22:39:33 I fucking love #UKIPDoArcticMonkeys?? ukip
20 51.448574 -2.604144 2015-05-05 23:29:15 @suethorney I think the on the ground campaign... labour
21 52.980765 -1.312040 2015-05-05 23:41:36 Save the #NHS ditch the #Conservative and elec... conservative
22 52.980819 -1.312090 2015-05-05 23:46:14 All children to be to have a professionally qu... conservative
23 53.620561 -2.213529 2015-05-05 23:53:49 #UKIPDoArcticMonkeys One for the road back to ... ukip
24 53.806855 -1.656803 2015-05-05 23:55:47 @LiarMPs @UKLabour What do you say about segre... labour
25 55.862847 -4.261423 2015-05-06 00:09:19 @BBCNews. Notice as usal nothing mentioned of ... labor
26 52.980876 -1.312153 2015-05-06 00:11:05 #Conservative pack your bags and get out (plea... conservative
27 55.838066 -4.421791 2015-05-06 00:13:40 how funny are the #UKIPDoArcticMonkeys tweets ... ukip
28 51.440717 0.248543 2015-05-06 00:59:08 @SLATUKIP Not just #UKIP But a teabagger see h... ukip
29 51.506300 -0.127100 2015-05-06 01:00:10 6. #NashsNewVideo7. #UKIPDoArcticMonkeys8. Mad... ukip
30 53.620702 -2.213447 2015-05-06 01:05:39 #UKIPDoCourteeners why are you so in love wit ukip
... ... ... ... ... ...
588 51.429550 -0.332730 2015-05-10 13:49:57 http://t.co/IgwMzoAMow. Nick Cohen #Labours pr... labour
589 52.311314 -1.708465 2015-05-10 14:16:38 #TeamChuka #Labour #LabourLeadershipElection labour
590 51.499266 -0.124436 2015-05-10 14:28:36 #GE2015: No-one is too rich to be in #Labour: ... labour
591 51.447900 -3.179960 2015-05-10 15:45:05 Wedi siomi gydar diffyg pleidleisio tactegol y... ukip
592 51.499266 -0.124436 2015-05-10 16:34:16 #GE2015: #Labour leadership: whos in the runni... labour
593 51.499266 -0.124436 2015-05-10 17:38:41 #GE2015: Garden centre owners 10% tax for #Con... conservative
594 53.411246 -2.978957 2015-05-10 17:53:43 @CllrKennedy @dan4barnsley Potentially just th... labour
595 51.499266 -0.124436 2015-05-10 19:12:10 #GE2015: Former #Labour ministers call for ban... labour
596 51.443730 -0.332640 2015-05-10 20:38:10 Surprise choice by #UKIP for new leader We had... ukip
597 51.409342 0.014571 2015-05-10 21:11:39 @GoodwinMJ @CaitlinMilazzo The existence of #U... ukip
598 51.499266 -0.124436 2015-05-10 22:52:23 #GE2015: #Conservative backbenchers threaten t... conservative
599 51.156807 -0.152688 2015-05-10 22:58:08 Excellent from @paulwaugh on @UKLabour leader ... labour
600 51.499266 -0.124436 2015-05-11 00:58:12 #GE2015: Dan Jarvis Rules Himself Out Of #Labo... labour
601 51.499266 -0.124436 2015-05-11 06:14:40 #GE2015: One-nation #Labour can still be the v... labour
602 51.499266 -0.124436 2015-05-11 07:17:14 #GE2015: Inside the campaigns: Pursuing a #Con... conservative
603 52.519695 -1.820007 2015-05-11 07:45:17 If #Labour elect a leader younger than 40 it w... labour
604 51.499266 -0.124436 2015-05-11 07:48:10 #GE2015: Inside the campaigns: cavalier #LibDe... libdems
605 53.397170 -1.499913 2015-05-11 08:21:21 Go Harriet!! #Labourleadership @BBCRadio4 labour
606 51.516775 -0.179670 2015-05-11 10:23:24 #UKIP conspiracy wonk connects #NigelFarage #T... ukip
607 51.499266 -0.124436 2015-05-11 10:28:43 #GE2015: The Big Energy Challenge for the New ... conservative
608 51.428730 -0.198030 2015-05-11 10:35:01 Disillusionment was a prized state of mind in ... labour
609 51.516761 -0.175187 2015-05-11 10:38:10 @MargotLJParker @PurpleArmy15 @BBCr4today What... ukip
610 51.516771 -0.175736 2015-05-11 10:43:24 #UKIP MEPs are raking in nearly £20 ukip
611 52.924650 -3.047840 2015-05-11 10:45:43 @CarlPackman The #Labour party are finished if... labour
612 51.516757 -0.174943 2015-05-11 10:47:13 The mission now is to expose the sham #UKIP ME... ukip
613 51.516757 -0.174941 2015-05-11 10:49:14 @quintessentia16 Its now time to expose the co... ukip
614 51.259092 -1.111522 2015-05-11 11:51:13 Before the election it was regarded as shamefu... ukip
615 51.508530 -0.125740 2015-05-11 12:02:07 Trends in #London: 1 #GE2015 2 #OnTopicTalkSho... labour
616 56.390008 -3.452144 2015-05-11 13:15:33 #Labour. Lord Sugar jumps ship saying he had b... labour
617 51.470652 0.190860 2015-05-04 14:22:12 #incompetent buffoons #the coalition #what a m... labour

617 rows × 5 columns

Quick data insights

In [90]:
tweets.head()
Out[90]:
lat lon date text party
1 52.676707 -1.824675 2015-05-05 13:54:24 My #vote is going to the #Conservative #party ... conservative
2 53.804250 -1.762647 2015-05-05 14:00:21 #miliband might break #Edstone pledges http://... labour
3 53.210070 -0.606384 2015-05-05 14:01:24 Just bumped into a man from UKIP on the electi... ukip
4 51.499266 -0.124436 2015-05-05 14:05:07 #GE2015: #LibDems would give biggest party tim... libdems
5 51.598158 -0.039050 2015-05-05 14:16:33 @danielacalota1 I feel no love for #Labour MPs... labour
In [91]:
tweets.dtypes
Out[91]:
lat      float64
lon      float64
date      object
text      object
party     object
dtype: object
In [92]:
tweets.describe()   # basic stats for numerical columns
Out[92]:
lat lon
count 616.000000 616.000000
mean 52.393075 -1.224286
std 1.266448 1.419234
min 50.241247 -6.511087
25% 51.499266 -2.146053
50% 51.621551 -1.093060
75% 53.389847 -0.124436
max 57.121479 1.732371
In [93]:
tweets['date'].value_counts()
Out[93]:
2015-05-08 12:17:06    2
2015-05-08 07:43:10    2
2015-05-08 11:03:30    1
2015-05-07 09:57:00    1
2015-05-07 12:46:58    1
2015-05-05 17:13:39    1
2015-05-07 14:19:42    1
2015-05-11 07:45:17    1
2015-05-07 22:54:00    1
2015-05-08 11:29:52    1
2015-05-06 19:40:05    1
2015-05-10 10:19:42    1
2015-05-08 07:27:16    1
2015-05-08 07:26:06    1
2015-05-08 08:27:38    1
...
2015-05-08 09:02:35    1
2015-05-09 09:25:18    1
2015-05-08 00:07:45    1
2015-05-08 11:54:06    1
2015-05-10 12:21:20    1
2015-05-08 13:19:00    1
2015-05-07 11:23:56    1
2015-05-07 11:23:57    1
2015-05-10 10:15:37    1
2015-05-07 08:04:30    1
2015-05-06 11:29:59    1
2015-05-08 02:36:46    1
2015-05-11 10:35:01    1
2015-05-07 23:06:41    1
2015-05-08 20:22:14    1
Length: 615, dtype: int64

Sorting

In [94]:
# sorting by column
tweets.sort(columns='lat').head()
Out[94]:
lat lon date text party
486 50.241247 -5.254833 2015-05-08 12:42:55 Also if the Greens and not #UKIP had 3million ... ukip
158 50.371497 -4.139490 2015-05-07 09:21:52 #IVoted #Labour @LukePollard for Plymouth Sutt... labour
543 50.389525 -4.176522 2015-05-09 10:10:31 #Labour lost 148 councillors in England BBC Ne... labour
98 50.389803 -4.130559 2015-05-06 22:35:20 Best of luck to @LukePollard tomorrow. #Labour... labour
177 50.398455 -4.418593 2015-05-07 10:09:31 If youre thinking of voting #UKIP today just r... ukip

Data Selection

In [95]:
# selection
tweets['text']
Out[95]:
1     My #vote is going to the #Conservative #party ...
2     #miliband might break #Edstone pledges http://...
3     Just bumped into a man from UKIP on the electi...
4     #GE2015: #LibDems would give biggest party tim...
5     @danielacalota1 I feel no love for #Labour MPs...
6     @danielacalota1 course. But the Revolution don...
7     #UKIP candidate for @Maldon_  is married to an...
8     @pimpmytweeting @Rothwelllad001 @Boadicea51  @...
9     #GE2015: #Labour manifesto 2015: the key polic...
10    @john_r_connolly #Conservatives #Tories #saman...
11    Fucking #Labour.Stop posting campaign shit thr...
12    #GE2015: @David_Cameron Photobombed By #UKIP S...
13    #GE2015: ##Conservative have two-point lead ov...
14    Well the #itv quiz still didnt help me pick be...
15    Staying home from uni this week just so I can ...
...
603    If #Labour elect a leader younger than 40 it w...
604    #GE2015: Inside the campaigns: cavalier #LibDe...
605            Go Harriet!! #Labourleadership @BBCRadio4
606    #UKIP conspiracy wonk connects #NigelFarage #T...
607    #GE2015: The Big Energy Challenge for the New ...
608    Disillusionment was a prized state of mind in ...
609    @MargotLJParker @PurpleArmy15 @BBCr4today What...
610                  #UKIP MEPs are raking in nearly £20
611    @CarlPackman The #Labour party are finished if...
612    The mission now is to expose the sham #UKIP ME...
613    @quintessentia16 Its now time to expose the co...
614    Before the election it was regarded as shamefu...
615    Trends in #London: 1 #GE2015 2 #OnTopicTalkSho...
616    #Labour. Lord Sugar jumps ship saying he had b...
617    #incompetent buffoons #the coalition #what a m...
Name: text, Length: 617, dtype: object
In [96]:
# boolean selection
tweets[tweets.party == 'labour'].head()
Out[96]:
lat lon date text party
2 53.804250 -1.762647 2015-05-05 14:00:21 #miliband might break #Edstone pledges http://... labour
5 51.598158 -0.039050 2015-05-05 14:16:33 @danielacalota1 I feel no love for #Labour MPs... labour
6 51.598160 -0.039425 2015-05-05 14:17:32 @danielacalota1 course. But the Revolution don... labour
11 53.790048 -1.648777 2015-05-05 17:44:21 Fucking #Labour.Stop posting campaign shit thr... labour
15 53.743847 -2.711041 2015-05-05 20:20:53 Staying home from uni this week just so I can ... labour
In [4]:
tweets.party == 'labour'
Out[4]:
1     False
2      True
3     False
4     False
5      True
6      True
7     False
8     False
9     False
10    False
11     True
12    False
13    False
14    False
15     True
...
603     True
604    False
605     True
606    False
607    False
608     True
609    False
610    False
611     True
612    False
613    False
614    False
615     True
616     True
617     True
Name: party, Length: 617, dtype: bool
In [97]:
# selection by row label
tweets.loc[2:10]
Out[97]:
lat lon date text party
2 53.804250 -1.762647 2015-05-05 14:00:21 #miliband might break #Edstone pledges http://... labour
3 53.210070 -0.606384 2015-05-05 14:01:24 Just bumped into a man from UKIP on the electi... ukip
4 51.499266 -0.124436 2015-05-05 14:05:07 #GE2015: #LibDems would give biggest party tim... libdems
5 51.598158 -0.039050 2015-05-05 14:16:33 @danielacalota1 I feel no love for #Labour MPs... labour
6 51.598160 -0.039425 2015-05-05 14:17:32 @danielacalota1 course. But the Revolution don... labour
7 51.730727 0.474504 2015-05-05 14:42:55 #UKIP candidate for @Maldon_ is married to an... ukip
8 51.532306 -0.201850 2015-05-05 16:05:47 @pimpmytweeting @Rothwelllad001 @Boadicea51 @... ukip
9 51.499266 -0.124436 2015-05-05 17:13:39 #GE2015: #Labour manifesto 2015: the key polic... labor
10 51.556614 -0.227828 2015-05-05 17:24:17 @john_r_connolly #Conservatives #Tories #saman... conservative

Data Cleaning

  • Setting data types
In [98]:
tweets.dtypes
Out[98]:
lat      float64
lon      float64
date      object
text      object
party     object
dtype: object
In [99]:
tweets['date'] = pd.to_datetime(tweets['date'])
  • Overwriting existing values
In [57]:
tweets[tweets.party == 'labor']
Out[57]:
lat lon date text party
9 51.499266 -0.124436 2015-05-05 17:13:39 #GE2015: #Labour manifesto 2015: the key polic... labor
25 55.862847 -4.261423 2015-05-06 00:09:19 @BBCNews. Notice as usal nothing mentioned of ... labor
46 51.519845 -0.112177 2015-05-06 09:30:02 Heres @DaveHill arguing why London needs a #La... labor
88 51.991713 -1.077186 2015-05-06 21:09:31 More NHS sold off under @UKLabour using yr ter... labor
In [58]:
tweets.loc[tweets.party == 'labor','party'] = 'labour'
In [59]:
tweets[tweets.party == 'labor']
Out[59]:
lat lon date text party
  • Dealing with missing data
In [60]:
tweets[tweets['lat'].apply(np.isnan)]
Out[60]:
lat lon date text party
12 NaN NaN 2015-05-05 17:44:37 #GE2015: @David_Cameron Photobombed By #UKIP S... ukip
In [61]:
tweets2 = tweets.dropna(how='any').head()    # drops any rows that have missing values
In [62]:
tweets2[tweets2['lat'].apply(np.isnan)]
Out[62]:
lat lon date text party
In [264]:
tweets3 = tweets.fillna(value=5).head()    # fills NaN values with 5
In [265]:
tweets3[tweets3['lat'].apply(np.isnan)]
Out[265]:
lat lon date text party

String Methods

In [103]:
tweets.text.str.lower()    # make text lowercase
Out[103]:
1     my #vote is going to the #conservative #party ...
2     #miliband might break #edstone pledges http://...
3     just bumped into a man from ukip on the electi...
4     #ge2015: #libdems would give biggest party tim...
5     @danielacalota1 i feel no love for #labour mps...
6     @danielacalota1 course. but the revolution don...
7     #ukip candidate for @maldon_  is married to an...
8     @pimpmytweeting @rothwelllad001 @boadicea51  @...
9     #ge2015: #labour manifesto 2015: the key polic...
10    @john_r_connolly #conservatives #tories #saman...
11    fucking #labour.stop posting campaign shit thr...
12    #ge2015: @david_cameron photobombed by #ukip s...
13    #ge2015: ##conservative have two-point lead ov...
14    well the #itv quiz still didnt help me pick be...
15    staying home from uni this week just so i can ...
...
603    if #labour elect a leader younger than 40 it w...
604    #ge2015: inside the campaigns: cavalier #libde...
605            go harriet!! #labourleadership @bbcradio4
606    #ukip conspiracy wonk connects #nigelfarage #t...
607    #ge2015: the big energy challenge for the new ...
608    disillusionment was a prized state of mind in ...
609    @margotljparker @purplearmy15 @bbcr4today what...
610                  #ukip meps are raking in nearly £20
611    @carlpackman the #labour party are finished if...
612    the mission now is to expose the sham #ukip me...
613    @quintessentia16 its now time to expose the co...
614    before the election it was regarded as shamefu...
615    trends in #london: 1 #ge2015 2 #ontopictalksho...
616    #labour. lord sugar jumps ship saying he had b...
617    #incompetent buffoons #the coalition #what a m...
Name: text, Length: 617, dtype: object

Grouping

In [104]:
tweets.groupby('party').size()
Out[104]:
party
conservative    164
labor             4
labour          299
libdems          32
ukip            118
dtype: int64

SQL-like Joins

It is also possible to join multiple dataframes on common keys...

Machine Learning: Scikit

Two (three) types of machine learning

  • Supervised
  • Unsupervised
  • Reinforcement

Supervised learning

Given: data + labels

Goal: predict lables given data

Examples:

  • Classification (spam, sentiment analysis, ...)
  • Regression (stocks, sales, ...)
  • Ranking (retrieval, search, ...)

Unsupervised learning

Given: data

Goal: learn internal data structure

Examples:

  • Dimensionality reduction
  • Clustering
  • Manifold learning

Scikit-learn

  • Comprehensive collection of machine learning algorithms and tools in Python.
  • Used both in academia and industry (Spotify, bit.ly, Evernote).
  • ~20 core developers.
  • Clean code and very good documentation.
  • Note: only suitable for numerical data!

Basic usage

  • Everything is a numpy array (or a scipy sparse matrix)!
  • Store data in matrix X, features in vector y.
  • X.shape is always (n_samples, n_feature)
  • The only commands you need to run any algorithm are: model.fit(X), model.predict(X) / model.transform(X) !

Example

In [16]:
from sklearn import datasets
iris = datasets.load_iris()
In [67]:
iris.keys()
Out[67]:
['target_names', 'data', 'target', 'DESCR', 'feature_names']
In [68]:
iris.values()
Out[68]:
[array(['setosa', 'versicolor', 'virginica'], 
       dtype='|S10'), array([[ 5.1,  3.5,  1.4,  0.2],
        [ 4.9,  3. ,  1.4,  0.2],
        [ 4.7,  3.2,  1.3,  0.2],
        [ 4.6,  3.1,  1.5,  0.2],
        [ 5. ,  3.6,  1.4,  0.2],
        [ 5.4,  3.9,  1.7,  0.4],
        [ 4.6,  3.4,  1.4,  0.3],
        [ 5. ,  3.4,  1.5,  0.2],
        [ 4.4,  2.9,  1.4,  0.2],
        [ 4.9,  3.1,  1.5,  0.1],
        [ 5.4,  3.7,  1.5,  0.2],
        [ 4.8,  3.4,  1.6,  0.2],
        [ 4.8,  3. ,  1.4,  0.1],
        [ 4.3,  3. ,  1.1,  0.1],
        [ 5.8,  4. ,  1.2,  0.2],
        [ 5.7,  4.4,  1.5,  0.4],
        [ 5.4,  3.9,  1.3,  0.4],
        [ 5.1,  3.5,  1.4,  0.3],
        [ 5.7,  3.8,  1.7,  0.3],
        [ 5.1,  3.8,  1.5,  0.3],
        [ 5.4,  3.4,  1.7,  0.2],
        [ 5.1,  3.7,  1.5,  0.4],
        [ 4.6,  3.6,  1. ,  0.2],
        [ 5.1,  3.3,  1.7,  0.5],
        [ 4.8,  3.4,  1.9,  0.2],
        [ 5. ,  3. ,  1.6,  0.2],
        [ 5. ,  3.4,  1.6,  0.4],
        [ 5.2,  3.5,  1.5,  0.2],
        [ 5.2,  3.4,  1.4,  0.2],
        [ 4.7,  3.2,  1.6,  0.2],
        [ 4.8,  3.1,  1.6,  0.2],
        [ 5.4,  3.4,  1.5,  0.4],
        [ 5.2,  4.1,  1.5,  0.1],
        [ 5.5,  4.2,  1.4,  0.2],
        [ 4.9,  3.1,  1.5,  0.1],
        [ 5. ,  3.2,  1.2,  0.2],
        [ 5.5,  3.5,  1.3,  0.2],
        [ 4.9,  3.1,  1.5,  0.1],
        [ 4.4,  3. ,  1.3,  0.2],
        [ 5.1,  3.4,  1.5,  0.2],
        [ 5. ,  3.5,  1.3,  0.3],
        [ 4.5,  2.3,  1.3,  0.3],
        [ 4.4,  3.2,  1.3,  0.2],
        [ 5. ,  3.5,  1.6,  0.6],
        [ 5.1,  3.8,  1.9,  0.4],
        [ 4.8,  3. ,  1.4,  0.3],
        [ 5.1,  3.8,  1.6,  0.2],
        [ 4.6,  3.2,  1.4,  0.2],
        [ 5.3,  3.7,  1.5,  0.2],
        [ 5. ,  3.3,  1.4,  0.2],
        [ 7. ,  3.2,  4.7,  1.4],
        [ 6.4,  3.2,  4.5,  1.5],
        [ 6.9,  3.1,  4.9,  1.5],
        [ 5.5,  2.3,  4. ,  1.3],
        [ 6.5,  2.8,  4.6,  1.5],
        [ 5.7,  2.8,  4.5,  1.3],
        [ 6.3,  3.3,  4.7,  1.6],
        [ 4.9,  2.4,  3.3,  1. ],
        [ 6.6,  2.9,  4.6,  1.3],
        [ 5.2,  2.7,  3.9,  1.4],
        [ 5. ,  2. ,  3.5,  1. ],
        [ 5.9,  3. ,  4.2,  1.5],
        [ 6. ,  2.2,  4. ,  1. ],
        [ 6.1,  2.9,  4.7,  1.4],
        [ 5.6,  2.9,  3.6,  1.3],
        [ 6.7,  3.1,  4.4,  1.4],
        [ 5.6,  3. ,  4.5,  1.5],
        [ 5.8,  2.7,  4.1,  1. ],
        [ 6.2,  2.2,  4.5,  1.5],
        [ 5.6,  2.5,  3.9,  1.1],
        [ 5.9,  3.2,  4.8,  1.8],
        [ 6.1,  2.8,  4. ,  1.3],
        [ 6.3,  2.5,  4.9,  1.5],
        [ 6.1,  2.8,  4.7,  1.2],
        [ 6.4,  2.9,  4.3,  1.3],
        [ 6.6,  3. ,  4.4,  1.4],
        [ 6.8,  2.8,  4.8,  1.4],
        [ 6.7,  3. ,  5. ,  1.7],
        [ 6. ,  2.9,  4.5,  1.5],
        [ 5.7,  2.6,  3.5,  1. ],
        [ 5.5,  2.4,  3.8,  1.1],
        [ 5.5,  2.4,  3.7,  1. ],
        [ 5.8,  2.7,  3.9,  1.2],
        [ 6. ,  2.7,  5.1,  1.6],
        [ 5.4,  3. ,  4.5,  1.5],
        [ 6. ,  3.4,  4.5,  1.6],
        [ 6.7,  3.1,  4.7,  1.5],
        [ 6.3,  2.3,  4.4,  1.3],
        [ 5.6,  3. ,  4.1,  1.3],
        [ 5.5,  2.5,  4. ,  1.3],
        [ 5.5,  2.6,  4.4,  1.2],
        [ 6.1,  3. ,  4.6,  1.4],
        [ 5.8,  2.6,  4. ,  1.2],
        [ 5. ,  2.3,  3.3,  1. ],
        [ 5.6,  2.7,  4.2,  1.3],
        [ 5.7,  3. ,  4.2,  1.2],
        [ 5.7,  2.9,  4.2,  1.3],
        [ 6.2,  2.9,  4.3,  1.3],
        [ 5.1,  2.5,  3. ,  1.1],
        [ 5.7,  2.8,  4.1,  1.3],
        [ 6.3,  3.3,  6. ,  2.5],
        [ 5.8,  2.7,  5.1,  1.9],
        [ 7.1,  3. ,  5.9,  2.1],
        [ 6.3,  2.9,  5.6,  1.8],
        [ 6.5,  3. ,  5.8,  2.2],
        [ 7.6,  3. ,  6.6,  2.1],
        [ 4.9,  2.5,  4.5,  1.7],
        [ 7.3,  2.9,  6.3,  1.8],
        [ 6.7,  2.5,  5.8,  1.8],
        [ 7.2,  3.6,  6.1,  2.5],
        [ 6.5,  3.2,  5.1,  2. ],
        [ 6.4,  2.7,  5.3,  1.9],
        [ 6.8,  3. ,  5.5,  2.1],
        [ 5.7,  2.5,  5. ,  2. ],
        [ 5.8,  2.8,  5.1,  2.4],
        [ 6.4,  3.2,  5.3,  2.3],
        [ 6.5,  3. ,  5.5,  1.8],
        [ 7.7,  3.8,  6.7,  2.2],
        [ 7.7,  2.6,  6.9,  2.3],
        [ 6. ,  2.2,  5. ,  1.5],
        [ 6.9,  3.2,  5.7,  2.3],
        [ 5.6,  2.8,  4.9,  2. ],
        [ 7.7,  2.8,  6.7,  2. ],
        [ 6.3,  2.7,  4.9,  1.8],
        [ 6.7,  3.3,  5.7,  2.1],
        [ 7.2,  3.2,  6. ,  1.8],
        [ 6.2,  2.8,  4.8,  1.8],
        [ 6.1,  3. ,  4.9,  1.8],
        [ 6.4,  2.8,  5.6,  2.1],
        [ 7.2,  3. ,  5.8,  1.6],
        [ 7.4,  2.8,  6.1,  1.9],
        [ 7.9,  3.8,  6.4,  2. ],
        [ 6.4,  2.8,  5.6,  2.2],
        [ 6.3,  2.8,  5.1,  1.5],
        [ 6.1,  2.6,  5.6,  1.4],
        [ 7.7,  3. ,  6.1,  2.3],
        [ 6.3,  3.4,  5.6,  2.4],
        [ 6.4,  3.1,  5.5,  1.8],
        [ 6. ,  3. ,  4.8,  1.8],
        [ 6.9,  3.1,  5.4,  2.1],
        [ 6.7,  3.1,  5.6,  2.4],
        [ 6.9,  3.1,  5.1,  2.3],
        [ 5.8,  2.7,  5.1,  1.9],
        [ 6.8,  3.2,  5.9,  2.3],
        [ 6.7,  3.3,  5.7,  2.5],
        [ 6.7,  3. ,  5.2,  2.3],
        [ 6.3,  2.5,  5. ,  1.9],
        [ 6.5,  3. ,  5.2,  2. ],
        [ 6.2,  3.4,  5.4,  2.3],
        [ 5.9,  3. ,  5.1,  1.8]]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), 'Iris Plants Database\n\nNotes\n-----\nData Set Characteristics:\n    :Number of Instances: 150 (50 in each of three classes)\n    :Number of Attributes: 4 numeric, predictive attributes and the class\n    :Attribute Information:\n        - sepal length in cm\n        - sepal width in cm\n        - petal length in cm\n        - petal width in cm\n        - class:\n                - Iris-Setosa\n                - Iris-Versicolour\n                - Iris-Virginica\n    :Summary Statistics:\n    ============== ==== ==== ======= ===== ====================\n                    Min  Max   Mean    SD   Class Correlation\n    ============== ==== ==== ======= ===== ====================\n    sepal length:   4.3  7.9   5.84   0.83    0.7826\n    sepal width:    2.0  4.4   3.05   0.43   -0.4194\n    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)\n    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)\n    ============== ==== ==== ======= ===== ====================\n    :Missing Attribute Values: None\n    :Class Distribution: 33.3% for each of 3 classes.\n    :Creator: R.A. Fisher\n    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n    :Date: July, 1988\n\nThis is a copy of UCI ML iris datasets.\nhttp://archive.ics.uci.edu/ml/datasets/Iris\n\nThe famous Iris database, first used by Sir R.A Fisher\n\nThis is perhaps the best known database to be found in the\npattern recognition literature.  Fisher\'s paper is a classic in the field and\nis referenced frequently to this day.  (See Duda & Hart, for example.)  The\ndata set contains 3 classes of 50 instances each, where each class refers to a\ntype of iris plant.  One class is linearly separable from the other 2; the\nlatter are NOT linearly separable from each other.\n\nReferences\n----------\n   - Fisher,R.A. "The use of multiple measurements in taxonomic problems"\n     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to\n     Mathematical Statistics" (John Wiley, NY, 1950).\n   - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.\n     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.\n   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System\n     Structure and Classification Rule for Recognition in Partially Exposed\n     Environments".  IEEE Transactions on Pattern Analysis and Machine\n     Intelligence, Vol. PAMI-2, No. 1, 67-71.\n   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions\n     on Information Theory, May 1972, 431-433.\n   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II\n     conceptual clustering system finds 3 classes in the data.\n   - Many, many more ...\n', ['sepal length (cm)',
  'sepal width (cm)',
  'petal length (cm)',
  'petal width (cm)']]
In [64]:
iris.keys()
Out[64]:
['target_names', 'data', 'target', 'DESCR', 'feature_names']
In [65]:
iris['feature_names']
Out[65]:
['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']
In [17]:
X = iris['data']    # data
y = iris['target']  # labels
In [71]:
X.shape  # 150 samples with 4 features
Out[71]:
(150, 4)
In [72]:
X[0,:]
Out[72]:
array([ 5.1,  3.5,  1.4,  0.2])

Supervised learning: classification

In [73]:
from sklearn import svm
modelsvm = svm.SVC()
modelsvm.fit(X, y)  
Out[73]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)
In [ ]:
??modelsvm

Classify any example of choice:

In [111]:
modelsvm.predict([[5,0.5,10,1]])
Out[111]:
array([2])

Altearnatively, split the data into training and testing:

In [74]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
In [75]:
X_train.shape, y_train.shape
Out[75]:
((100, 4), (100,))
In [76]:
X_test.shape, y_test.shape
Out[76]:
((50, 4), (50,))
In [77]:
modelsvm2 = svm.SVC()
modelsvm2.fit(X_train, y_train)  
Out[77]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)
In [116]:
modelsvm2.predict(X_test)
Out[116]:
array([1, 0, 2, 1, 1, 0, 1, 2, 1, 1, 2, 0, 0, 0, 0, 1, 2, 1, 1, 2, 0, 2, 0,
       2, 2, 2, 2, 2, 0, 0, 0, 0, 1, 0, 0, 2, 1, 0, 0, 0, 2, 1, 1, 0, 0, 1,
       2, 2, 1, 2])
In [78]:
modelsvm2.score(X_test, y_test)   
Out[78]:
1.0

Unsupervised learning: clustering

In [18]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2)
In [19]:
kmeans.fit(X)
Out[19]:
KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=2, n_init=10,
    n_jobs=1, precompute_distances=True, random_state=None, tol=0.0001,
    verbose=0)
In [81]:
??kmeans
In [20]:
kmeans.labels_
Out[20]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)
In [21]:
plt.scatter(X[:,0],X[:,1],c=kmeans.labels_.astype(np.float),edgecolors='none')
plt.xlabel(iris['feature_names'][0])
plt.ylabel(iris['feature_names'][1])
Out[21]:
<matplotlib.text.Text at 0x107f95a10>

Algorithm cheat-sheet

Final things to remember:

  • indexing starts from zero,
  • some objects are mutable in Python (e.g. lists), check using id(),
  • indenting instead of brackets
In [85]:
a = [1,3,5]
In [86]:
id(a)
Out[86]:
4409868512
In [87]:
b = a
In [88]:
b.append(6)
In [89]:
b
Out[89]:
[1, 3, 5, 6]
In [90]:
a
Out[90]:
[1, 3, 5, 6]
In [91]:
id(b)
Out[91]:
4409868512
In [92]:
c = list(a)
In [93]:
id(c)
Out[93]:
4443682432
In [94]:
c
Out[94]:
[1, 3, 5, 6]
In [95]:
??list