Pragmatic

Polyglot Data Analysis

Laurent Gautier - ODSC - Boston Data Festival 2016

(notebook as HTML| ipynb)

Disclaimer:

I am here as an individual, long-active Open Source developer and data scientist. Views and opinions expressed are solely my own.

Slides

http://lgautier.github.io/rpy2/odsc-bdf-2016-slides

Acknowledgments

Organizers and sponsors of this meeting
Willem G. van Panhuis, John Grefenstette, Su Yon Jung, Nian Shong Chok, Anne Cross, Heather Eng, Bruce Y Lee, Vladimir Zadorozhny, Shawn Brown, Derek Cummings, Donald S. Burke. Contagious Diseases in the United States from 1888 to the present. NEJM 2013; 369(22): 2152-2158.
authors, contributors, and sponsors for
- R
- Python
- R packages
- Python packages
- Docker
rpy2 contributors
rpy2 users (in the hope they become contributors)

This talk contains edible art runnable code

The easy way:

Install Docker:
https://www.docker.com

Run:


docker run --rm -it -p 8888:8888 \
            lgautier/pragmatic-polyglot-data-analysis

Visit http://localhost:8888/

First, two questions. Who thinks that:

- mixing languages is a good idea ?

- is NOT a good idea ?

Is this familiar ?


cursor = dbcon.cursor()

sql = """
SELECT state, count(city)
FROM location
WHERE state LIKE 'M%'
GROUP BY state
"""

cursor.execute(sql)

Hardly groundbreaking.

That's polyglot data analysis though.

You^* probably have already done polyglot data analysis

(*: yes, you)

You should not feel bad about it

Data in - Insights out

Data science: a language landscape

(alphabetical order - not comprehensive)

C++

bash

Javascript

Julia

Python

Scala

SQL

Julia ⇒ Python : PyCall

Julia ⇒ R : Rif or RCall

[somewhere] ⇒ R : Rserve

R ⇒ Python : rPython

Polyglot data analysis a project/ensemble with:

A master/glue/host language
Slave/guest language

Data in tables

Detailed account of raw materials and workdays for a basketry workshop. Clay, ca. 2040 BC (Ur III).
Source: wikipedia Marie-Lan Nguyen (2006)

Project Tycho^®

screenshot http://www.tycho.pitt.edu/

In[1]:

! git clone https://github.com/lgautier/project-tycho-utilities.git

fatal: destination path 'project-tycho-utilities' already exists and is not an empty directory.

In[2]:

! cd project-tycho-utilities/ && DBNAME=../tycho.db make all

make[1]: Entering directory '/opt/local/render/project-tycho-utilities'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/opt/local/render/project-tycho-utilities'

location
id	INTEGER
state	VARCHAR
city	VARCHAR

casecount
id	INTEGER
location_id	INTEGER
disease_id	INTEGER
count	INTEGER

disease
id	INTEGER
name	VARCHAR

Work with a relational database

In[3]:

import sqlite3
dbfilename = "tycho.db"
dbcon = sqlite3.connect(dbfilename)
cursor = dbcon.cursor()

In[4]:

sql = """
SELECT state, city
FROM location
WHERE state LIKE 'M%'
"""
cursor.execute(sql)

Out[4]:

<sqlite3.Cursor at 0x7f65dc575340>

Python and SQL

In[5]:

from collections import Counter
ct = Counter()
for row_n, (state, city) in enumerate(cursor, 1):
    ct[state] += 1
print(ct.most_common(n=5))

[('MA', 67), ('MI', 26), ('MN', 10), ('ME', 9), ('MO', 8)]

Slightly more complicated:

For each state, count the number of cities for which we have case counts for more than 5 different diseases. Oh, and sort the list of states in decreasing count order. In fact, only report the first 5 states.

In[7]:

sql = """
SELECT state, count(city) city_count
FROM (select D.location_id
      FROM (select location_id, COUNT(DISTINCT(disease_id)) AS disease_count
            FROM casecount
            GROUP BY location_id) AS D
      WHERE D.disease_count > 5) AS HD
INNER JOIN location
ON HD.location_id = location.id
GROUP BY state
ORDER BY city_count DESC
LIMIT 5
"""
cursor.execute(sql)

Out[7]:

<sqlite3.Cursor at 0x7f65dc575490>

Complex queries require increasingly complex SQL

⇨ So we need to know SQL, don't we ?

Object-relational mapping (ORM)

The promise to operate on ojects in the host language, without writting SQL.

Python and ORMs


cursor = dbcon.cursor()
sql = """
SELECT *
FROM location
WHERE state LIKE 'M%'
"""
cursor.execute(sql)
result = cursor.fetchall()

⇩


result = (Location
          .select()
          .where(Location.state.startswith('M')))

(example with SQLObject)

Set up

In[8]:

from sqlalchemy.ext.automap import automap_base
from sqlalchemy.orm import Session
from sqlalchemy import create_engine

Base = automap_base()

# engine, suppose it has two tables 'user' and 'address' set up
engine = create_engine("sqlite:///tycho.db")

Create classes through reflection

In[9]:

Base.prepare(engine, reflect=True)
location = Base.classes.location

Query as chained calls

In[10]:

session = Session(engine)
from sqlalchemy import func # SQL functions

query = (session
         .query(location.state,
                func.count(location.city))
         .filter(location.state.like('M%'))
         .group_by(location.state)
         .order_by(func.count(location.city).desc())
         .limit(5))
res = query.all()

Chained calls generate SQL

In[11]:

from sqlalchemy.dialects import sqlite
print(str(query.statement.compile(dialect=session.bind.dialect))
      .replace('GROUP BY', '\nGROUP BY'))

SELECT location.state, count(location.city) AS count_1 
FROM location 
WHERE location.state LIKE ? 
GROUP BY location.state ORDER BY count(location.city) DESC
 LIMIT ? OFFSET ?

We traded knowledge of SQL for the knowledge of an API.

About Java's Hibernate:

Back to tables

Model tables
Make most-common operations easy

dplyr

mutate	modify/add column
filter	filter rows
select	select columns
group_by	group rows
summarize	summarize groups of rows

(RStudio)

R code:


datasrc <- src_sqlite(dbfilename)
location_tbl <- tbl(datasrc, "location")


res <- filter(location_tbl,
              state %like% 'M%') %>%
       count(city) %>%
       arrange('desc(n)')

But... we started with Python, didn't we ?

Python vs R anyone ?

Python^R !

Just import the R package from Python

In[13]:

from rpy2.robjects.lib import dplyr


datasrc  = dplyr.src_sqlite(dbfilename)
location_tbl = datasrc.get_table("location")

Query as chained calls

In[14]:

res =  (location_tbl
        .filter('state %like% "M%"')
        .group_by('state')
        .count('state')
        .arrange('desc(n)'))

print(res)

Source:   query [?? x 2]
Database: sqlite 3.8.6 [tycho.db]

  state     n
  <chr> <int>
1    MA    67
2    MI    26
3    MN    10
4    ME     9
5    MO     8
6    MT     7
7    MD     5
8    MS     3
9    MP     1

Back to our complex SQL query

In[15]:

casecount_tbl = datasrc.get_table("casecount")

##
disease_per_city = (casecount_tbl
                    .group_by('location_id')
                    .summarize(n='count(distinct(disease_id))'))
##
high_disease = (disease_per_city
                .filter('n > 5'))
##
inner_join = dplyr.dplyr.inner_join
join_location = inner_join((location_tbl
                            .mutate(location_id="id")),
                           high_disease,
                           by="location_id")
res = (dplyr.DataFrame(join_location)
       .group_by('state')
       .summarize(n='count(city)')
       .arrange('desc(n)')
       .collect())

While we are with R, what about figures ?

(and about R as a DSL in Python ?)

In[16]:

from rpy2.robjects import r, globalenv
import rpy2.robjects.lib.ggplot2 as gg

p = (gg.ggplot(res.head(20)) +
     gg.geom_bar(gg.aes_string(x='factor(state, levels=as.character(state))', 
                               y='n'),
                 stat='identity') +
     gg.scale_x_discrete("State") +
     gg.scale_y_sqrt("# cities w/ >5 diseases"))

In [17]:

from rpy2.ipython.ggplot import image_png
from IPython.display import display

display(image_png(p))

In [18]:

from rpy2.robjects import baseenv
state_abb = (dplyr.DataFrame({'state': baseenv.get('state.abb'),
                             'region': baseenv.get('state.name')})
             .mutate(region = 'tolower(region)'))
from rpy2.robjects.packages import importr
maps = importr('maps')
states = gg.map_data('state')

merge = baseenv.get('merge')
states_map = merge(states, state_abb, sort=False, by="region")
dataf_plot = merge(states_map, res, sort=False, by="state")
dataf_plot = dplyr.DataFrame(dataf_plot).arrange('order')

p = (gg.ggplot(dataf_plot) +
     gg.geom_polygon(gg.aes_string(x='long', y='lat', group='group', fill='n')) +
     gg.scale_fill_continuous(trans="sqrt") +
     gg.coord_map("albers",  at0 = 45.5, lat1 = 29.5))
display(image_png(p))

In[19]:

sql_disease = """
SELECT date_from, count, city
FROM casecount
INNER JOIN disease
ON casecount.disease_id=disease.id
INNER JOIN location
ON casecount.location_id=location.id
WHERE disease.name='%s'
AND state='%s'
AND city IS NOT NULL
"""
sql = sql_disease % ('PNEUMONIA', 'MA')

In[20]:

dataf = dplyr.DataFrame(dplyr.tbl(datasrc, dplyr.dplyr.sql(sql))).collect()
dataf = dataf.mutate(date='as.POSIXct(strptime(date_from, format="%Y-%m-%d"))')
dataf = dataf.mutate(month = 'format(date, "%m")',
                     year = 'format(date, "%Y")')
# sum per month
dataf_plot = (dataf
              .group_by('city', 'month','year')
              .summarize(count='sum(count)'))
# 
yearmonth_to_date = """
as.POSIXct(
    strptime(
        paste(year, month, "15", sep="-"),
        format="%Y-%m-%d")
    )
"""

dataf_plot = dataf_plot.mutate(date=yearmonth_to_date)

In [26]:

p = (gg.ggplot(dataf_plot
               .filter('year > 1925')) +
     gg.geom_line(gg.aes_string(x='month', y='count',
     group='year', color='city')) +
     gg.facet_grid('city~.', scales="free_y") +
     gg.scale_y_sqrt())
display(image_png(p))

What about Javascript then ?

Widgets

In[29]:

def foo(disease, state):
    sql = sql_disease % (disease, state)
    dataf = dplyr.DataFrame(dplyr.tbl(datasrc, dplyr.dplyr.sql(sql))).collect()
    dataf = dataf.mutate(date='as.POSIXct(strptime(date_from, format="%Y-%m-%d"))')
    dataf = dataf.mutate(month = 'format(date, "%m")',
                         year = 'format(date, "%Y")')

    dataf_plot = (dataf
                  .group_by('city', 'month','year')
                  .summarize(count='sum(count)'))
    
    dataf_plot = dataf_plot.mutate(date=yearmonth_to_date)
    p = (gg.ggplot(dataf_plot
               .filter('year > 1925')) +
         gg.geom_line(gg.aes_string(x='month', y='count+1',
	                            group='year', color='city')) +
         gg.facet_grid('city~.', scales="free_y") +
         gg.scale_y_sqrt() +
         gg.ggtitle(disease))
    display(image_png(p, height=600))

In[30]:

from ipywidgets import interact
interact(foo,
         disease=('PNEUMONIA','INFLUENZA','MEASLES'),
         state=('MA','NH','CA'))

Out[30]:

<function __main__.foo>

Javascript for visualization

D3
RStudio's Shiny (hint: R from Python)
plotly (R pack. to plot ggplot2 objects)
Continuum's bokeh (python package)

Python functions generate Javascript code.

Python + R + javascript
trilingual before we even know it !

In[31]:

from bokeh.plotting import (figure, show,
                            ColumnDataSource,
                            output_notebook)
##from bokeh.resources import INLINE
output_notebook()

res = res.head(20)
plot = figure(x_range=list(res.rx2('state')))
source = ColumnDataSource(dict((x, tuple(res.rx2(x))) for x in res.colnames))
plot.vbar(x='state',
          bottom=0, top='n',
          width=0.5,
          color='STEELBLUE',
          source=source)

Loading BokehJS ...

Out[31]:

<bokeh.models.renderers.GlyphRenderer at 0x7f65e29465c0>

Spark

Resilient Distributed Dataset (RDD)
Abstracts the distribution of compute and data
Function composition optimized before evaluation

pyspark

Setup

In[33]:

import findspark
findspark.init()

import pyspark

conf = pyspark.conf.SparkConf()
(conf.setMaster('local[2]')
 .setAppName('ipython-notebook')
 .set("spark.executor.memory", "2g"))

sc = pyspark.SparkContext(conf=conf)

From SQL table to Spark DataFrame

In[34]:

from pyspark.sql import SQLContext, Row
sqlcontext = SQLContext(sc)
cursor.execute("SELECT * FROM location")
location = \
    sqlcontext.createDataFrame(cursor,
                               tuple(x[0] for x in cursor.description))
location.registerTempTable("location")

sql = """
SELECT * 
FROM (SELECT * FROM disease WHERE name='PNEUMONIA') AS disease
INNER JOIN casecount
ON disease.id=casecount.disease_id"""

cursor.execute(sql)
casecount = \
    sqlcontext.createDataFrame(cursor,
                               tuple(x[0] for x in cursor.description))
casecount.registerTempTable("casecount")

The data is now distributed on a Spark cluster.

Running our query

In[35]:

sql = """
SELECT state, count(city) AS ct
FROM location
GROUP BY state
ORDER BY ct DESC
"""
counts = sqlcontext.sql(sql)

Good old SQL
Translated to Spark, optimized, and to JVM bytecode

(Lazy) evaluation

In[36]:

res = counts.collect()
res[:3]

Out[36]:

[Row(state='PA', ct=75), Row(state='MA', ct=66), Row(state='OH', ct=46)]

Map-reduce with Spark

In[37]:

names = (location
         .filter(location.city.isNotNull())
         .rdd
         .flatMap(lambda rec: [x[-5:] for x in rec.city.split()])
         .map(lambda word: (word.lower(), 1))
         .reduceByKey(lambda a, b: a+b))
names.takeOrdered(10, key = lambda x: -x[1])

Out[37]:

[('ville', 26),
 ('ngton', 18),
 ('city', 16),
 ('field', 15),
 ('sburg', 13),
 ('new', 13),
 ('ester', 11),
 ('stown', 7),
 ('west', 7),
 ('st', 6)]

SQL, Map-reduce with Spark

In[38]:

sql = """
SELECT city
FROM (SELECT * FROM casecount WHERE epiweek LIKE '1912%') AS sub
INNER JOIN location
ON location.id=sub.location_id
GROUP BY city
"""
y_1912 = sqlcontext.sql(sql)
names = (y_1912
         .filter(y_1912.city.isNotNull())
         .rdd
         .flatMap(lambda rec: [x[-5:] for x in rec.city.split()])
         .map(lambda word: (word.lower(), 1))
         .reduceByKey(lambda a,b: a+b))
names.takeOrdered(5, key = lambda x: -x[1])

Out[38]:

[('ville', 6), ('new', 4), ('ngton', 4), ('stown', 4), ('san', 3)]

Quatuor

Python + R + Spark + SQL

(in two slides)

In[39]:

## --- SQL ---
sql = """
SELECT state, city, date_from, count AS ct
FROM (SELECT * FROM casecount WHERE epiweek LIKE '1912%') AS sub
INNER JOIN location
ON location.id=sub.location_id
"""
y_1912 = sqlcontext.sql(sql)

## --- Spark ---
cases = (y_1912
         .rdd
         .map(lambda rec: ((rec.state, int(rec.date_from.split('-')[1])),
                           rec.ct))
         .reduceByKey(lambda a,b: a+b)).collect()

In [40]:

## --- R (from Python) ---
from rpy2.robjects import StrVector, IntVector, FactorVector, Formula
months = StrVector([str(x) for x in range(1,13)])
res = dplyr.DataFrame({'state': StrVector([x[0][0] for x in cases]),
                       'month': FactorVector([x[0][1] for x in cases],
                                             levels = months),
                       'count': IntVector([x[1] for x in cases])})
dataf_plot = merge(states_map, res, all_x=True, sort=False, by="state")
dataf_plot = dplyr.DataFrame(dataf_plot).arrange('order')

jetcols = StrVector(("#00007F", "#007FFF", "#7FFF7F", "#FF7F00", "#7F0000"))
p = (gg.ggplot(dataf_plot) +
     gg.geom_polygon(gg.aes_string(x='long', y='lat',
                                   group='group', fill='count')) +
     gg.coord_map("albers",  at0 = 45.5, lat1 = 29.5) +
     gg.scale_fill_gradientn(colors=jetcols, trans='sqrt') +
     gg.facet_wrap(facets=Formula("~month")) +
     gg.ggtitle("Cases of Pneumonia in 1912"))

display(image_png(p))

In [41]:

dataf_now = dataf_plot.filter('month %in% c(9,10)')
p = (gg.ggplot(dataf_now) +
     gg.geom_polygon(gg.aes_string(x='long', y='lat',
                                   group='group', fill='count')) +
     gg.coord_map("albers",  at0 = 45.5, lat1 = 29.5) +
     gg.scale_fill_gradientn(colors=jetcols, trans='sqrt') +
     gg.facet_wrap(facets=Formula("~month")) +
     gg.ggtitle("Cases of Pneumonia in 1912"))

display(image_png(p))

In [42]:

dataf_now = (dataf_plot
             .filter('month %in% c(9,10)',
                     'state %in% c("CT", "NY", "MA", "NJ", "NH", "VM")'))
p = (gg.ggplot(dataf_now) +
     gg.geom_polygon(gg.aes_string(x='long', y='lat',
                                   group='group', fill='count')) +
     gg.coord_map("albers",  at0 = 45.5, lat1 = 29.5) +
     gg.scale_fill_gradientn(colors=jetcols, trans='sqrt') +
     gg.facet_wrap(facets=Formula("~month")) +
     gg.ggtitle("Cases of Pneumonia in 1912"))

display(image_png(p))

There would be more to cover...

Transpiler to C for (runtime) speed: Cython
JIT-compiling to LLVM for (runtime) speed: llvmlite
(Web) services: Thrift

In Summary

Data analysis needs better collectively covered by several languages
Use what is already out there. Be pragmatic.

Julia	⇒	Python	:	PyCall
Julia	⇒	R	:	Rif or RCall
[somewhere]	⇒	R	:	Rserve
R	⇒	Python	:	rPython

Pragmatic

Polyglot Data Analysis

Laurent Gautier - ODSC - Boston Data Festival 2016

Slides

Acknowledgments

This talk contains edible art runnable code

The easy way:

First, two questions. Who thinks that:

- mixing languages is a good idea ?

- is NOT a good idea ?

Is this familiar ?

Hardly groundbreaking.

That's polyglot data analysis though.

You* probably have already done polyglot data analysis

You should not feel bad about it

Data in - Insights out

Data science: a language landscape

(alphabetical order - not comprehensive)

Data in tables

Project Tycho®

Work with a relational database

Python and SQL

Slightly more complicated:

Object-relational mapping (ORM)

Python and ORMs

⇩

Set up

Create classes through reflection

Query as chained calls

Chained calls generate SQL

Back to tables

dplyr

R code:

But... we started with Python, didn't we ?

Python vs R anyone ?

PythonR !

Just import the R package from Python

Query as chained calls

Back to our complex SQL query

While we are with R, what about figures ?

What about Javascript then ?

Javascript for visualization

Spark

pyspark

Setup

From SQL table to Spark DataFrame

Running our query

(Lazy) evaluation

Map-reduce with Spark

SQL, Map-reduce with Spark

Quatuor

Python + R + Spark + SQL

There would be more to cover...

In Summary

You^* probably have already done polyglot data analysis

Project Tycho^®

Python^R !