Pragmatic

Polyglot Data Analysis

Laurent Gautier - Big Data Bootcamp Boston 2016

(notebook as HTML| ipynb)

Disclaimer:

I am here as an individual, long-active Open Source developer and data scientist. Views and opinions expressed are solely my own.

Slides

http://lgautier.github.io/rpy2/gbdc-ppda-slides

Acknowledgments

Organizers and sponsors of this meeting
authors, contributors, and sponsors for
- R
- Python
- R packages
- Python packages
- Docker
rpy2 contributors
rpy2 users (in the hope they become contributors)

This talk contains edible art runnable code

The easy way:


docker run --rm -it -p 8888:8888 \
            rpy2/gbdc-ppda-slides

Visit http://localhost:8888/ ^*

*: If docker-machine (Windows, OS X), the address is:


	      docker-machine ip [MACHINE]

First, two questions. Who thinks that:

- mixing languages is a good idea ?

- is NOT a good idea ?

Is this familiar ?


cursor = dbcon.cursor()

sql = """
SELECT reviewhelpful, count(*)
FROM review
WHERE reviewscore < 2
GROUP BY reviewhelpful
"""

cursor.execute(sql)

Hardly groundbreaking.

That's polyglot data analysis though.

You^* probably have already done polyglot data analysis

(*: yes, you)

You should not feel bad about it

Data in - Insights out

Data science: a language landscape

(alphabetical order - not comprehensive)

C++

bash

Javascript

Julia

Python

Scala

SQL

Julia ⇒ Python : PyCall

Julia ⇒ R : Rif or RCall

[somewhere] ⇒ R : Rserve

R ⇒ Python : rPython

Polyglot data analysis a project/ensemble with:

A master/glue/host language
Slave/guest language

Data in tables

Detailed account of raw materials and workdays for a basketry workshop. Clay, ca. 2040 BC (Ur III).
Source: wikipedia Marie-Lan Nguyen (2006)

Fine foods

J. McAuley and J. Leskovec. From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. WWW, 2013. http://snap.stanford.edu/data/web-FineFoods.html

product/productId: B00...W
review/userId: A1S...MSMR
review/profileName: [some profile name]
review/helpfulness: 0/0
review/score: 1.0
review/time: 1344556800
review/summary: Sad outcome
review/text: Five minutes in, one tentacle was bitten off, ball inside
             cracked in half. Not durable enough to be a dog toy.
             Disappointed :(. So is the dog :(.

productid	VARCHAR
reviewerid	VARCHAR
reviewername	VARCHAR
reviewhelpful	VARCHAR
reviewscore	INTEGER
reviewtime	INTEGER
reviewsummary	VARCHAR
reviewtext	VARCHAR

Work with a relational database

In[1]:

import sqlite3
dbfilename = "/opt/data/finefoods.db"
dbcon = sqlite3.connect(dbfilename)

In[2]:

cursor = dbcon.cursor()
sql = """
SELECT reviewhelpful, count(*)
FROM review
WHERE reviewscore < 2
GROUP BY reviewhelpful
"""
cursor.execute(sql)

Out[2]:

<sqlite3.Cursor at 0x7f998441d1f0>

Python and SQL

In[3]:

from collections import Counter
ct = Counter()
for row_n, (score, count) in enumerate(cursor, 1):
    ct[score] = count
print(ct.most_common(n=3))

[('0/0', 12266), ('1/1', 4809), ('0/1', 3701)]

Slightly more complicated: Distribution of helpfulness rating of reviews coming from active reviewers giving a low score.

In[5]:

sql = """
select reviewhelpful, count(*)
from (select T.reviewername
      from (select reviewername, count(*) as reviewer_count
            from review
            group by reviewername) as T
      where T.reviewer_count > 5) as U
inner join review
on U.reviewername = review.reviewername
where reviewscore < 2
group by reviewhelpful
"""
cursor.execute(sql)

Out[5]:

<sqlite3.Cursor at 0x7f998441d2d0>

Complex queries require increasingly complex SQL

⇨ So we need to know SQL, don't we ?

Object-relational mapping (ORM)

The promise to operate on ojects in the host language, without writting SQL.

Python and ORMs


cursor = dbcon.cursor()
sql = """
select *
from review
where reviewscore < 2
"""
cursor.execute(sql)
result = cursor.fetchall()

⇩


result = (Review
          .select()
          .where(Review.reviewscore < 2))

(example with SQLObject)

Set up

In[6]:

from sqlalchemy.ext.automap import automap_base
from sqlalchemy.orm import Session
from sqlalchemy import create_engine

Base = automap_base()

# engine, suppose it has two tables 'user' and 'address' set up
engine = create_engine("sqlite:////opt/data/finefoods.db")

Create classes through reflection

In[7]:

Base.prepare(engine, reflect=True)
review = Base.classes.review

Query as chained calls

In[8]:

session = Session(engine)
from sqlalchemy import func # SQL functions

query = (session
         .query(review.reviewhelpful,
                func.count(review.reviewhelpful))
         .filter(review.reviewscore < 2)
         .group_by(review.reviewhelpful)
         .order_by(func.count(review.reviewhelpful).desc()))

res = query.all()
res[:3]

Out[8]:

[('0/0', 12266), ('1/1', 4809), ('0/1', 3701)]

Chained calls generate SQL

In[9]:

from sqlalchemy.dialects import sqlite
print(str(query.statement.compile(dialect=sqlite.dialect()))
      .replace('GROUP BY', '\nGROUP BY'))

SELECT review.reviewhelpful, count(review.reviewhelpful) AS count_1 
FROM review 
WHERE review.reviewscore < ? 
GROUP BY review.reviewhelpful ORDER BY count(review.reviewhelpful) DESC

We traded knowledge of SQL for the knowledge of an API.

About Java's Hibernate:

Back to tables

Model tables
Make most-common operations easy

dplyr

mutate	modify/add column
filter	filter rows
select	select columns
group_by	group rows
summarize	summarize groups of rows

(RStudio)


datasrc <- src_sqlite(dbfilename)
review_tbl <- tbl(datasrc, "review")


res <- filter(review_tbl,
              reviewscore < 2) %>%
       count(reviewhelpful) %>%
       arrange('desc(n)')

But... we started with Python, didn't we ?

Python vs R anyone ?

Python^R !

Just import the R package from Python

In[11]:

from rpy2.robjects.lib import dplyr


datasrc  = dplyr.src_sqlite(dbfilename)
review_tbl = datasrc.get_table("review")

Query as chained calls

In[12]:

res =  (review_tbl
        .filter('reviewscore < 2')
        .count('reviewhelpful')
        .arrange('desc(n)'))

print(res)

Source:   query [?? x 2]
Database: sqlite 3.8.6 [/opt/data/finefoods.db]

   reviewhelpful     n
           <chr> <int>
1            0/0 12266
2            1/1  4809
3            0/1  3701
4            1/2  2718
5            2/2  2123
6            0/2  1702
7            2/3  1579
8            1/3  1342
9            3/3  1104
10           2/4   921
# ... with more rows

While we are with R, what about figures ?

(and about R as a DSL in Python ?)

In[13]:

from rpy2.robjects import r, globalenv
import rpy2.robjects.lib.ggplot2 as gg

split_review = \
    'as.integer(sapply(strsplit(reviewhelpful, "/"), "[", %i))'

p = (gg.ggplot((res
                .collect()
                .mutate(r1 = split_review % 1,
                        r2 = split_review % 2))) +
     gg.aes_string(x='r1+1', y='r2+1', size='n') +
     gg.geom_point(alpha = 0.5) +
     gg.scale_x_log10("review voted helpful") +
     gg.scale_y_log10("review") +
     gg.scale_size(trans="sqrt"))

In [14]:

from rpy2.robjects.lib import grdevices

with grdevices.render_to_bytesio(grdevices.png, 
                                 width=800,
                                 height=600, 
                                 res=120) as b:
    p.plot()

from IPython.display import Image, display
data = b.getvalue()
display(Image(data=data, format='png', embed=True))

What about Javascript then ?

D3
RStudio's Shiny (hint: R from Python)
plotly (R pack. to plot ggplot2 objects)
Continuum's bokeh (python package)

Python functions generate Javascript code.

Python + R + javascript
trilingual before we even know it !

In[15]:

from bokeh.plotting import figure
from bokeh.plotting import figure, show, output_notebook
##from bokeh.resources import INLINE
output_notebook()

plot = figure()
res =  (review_tbl
        .filter('reviewscore < 2')
        .count('reviewhelpful')
        .collect())

import math
plot.scatter(list(int(x.split('/')[0]) \
                  for x in res.rx2('reviewhelpful')),
             list(int(x.split('/')[1]) \
                  for x in res.rx2('reviewhelpful')),
             radius=list(math.log10(x/100) for x in res.rx2('n')),
             fill_alpha=.3)

Loading BokehJS ...

Out[15]:

<bokeh.models.renderers.GlyphRenderer at 0x7f998b2fe080>

Spark

Resilient Distributed Dataset (RDD)
Abstracts the distribution of compute and data
Function composition optimized before evaluation

pyspark

Setup

In[17]:

import findspark
findspark.init()

import pyspark

conf = pyspark.conf.SparkConf()
(conf.setMaster('local[2]')
 .setAppName('ipython-notebook')
 .set("spark.executor.memory", "2g"))

sc = pyspark.SparkContext(conf=conf)

From SQL table to Spark DataFrame

In[18]:

from pyspark.sql import SQLContext, Row
sqlcontext = SQLContext(sc)
cursor.execute('select * from review limit 10000')
review = \
    sqlcontext.createDataFrame(cursor,
                               tuple(x[0] for x in cursor.description))
review.registerTempTable("review")

The data is now distributed on a Spark cluster.

Running our query

In[19]:

sql = """
SELECT reviewhelpful, count(*) AS ct
FROM review
WHERE reviewscore < 2
GROUP BY reviewhelpful
ORDER BY ct DESC
"""
counts = sqlcontext.sql(sql)

Good old SQL
Translated to Spark, optimized, and to JVM bytecode

(Lazy) evaluation

In[20]:

res = counts.collect()
res[:3]

Out[20]:

[Row(reviewhelpful='0/0', ct=241),
 Row(reviewhelpful='1/1', ct=87),
 Row(reviewhelpful='0/1', ct=80)]

Map-reduce with Spark

In[21]:

names = (review
         .rdd
         .flatMap(lambda p: p.reviewsummary.split())
         .map(lambda word: (word.lower(), 1))
         .reduceByKey(lambda a, b: a+b))
names.takeOrdered(10, key = lambda x: -x[1])

Out[21]:

[('great', 1259),
 ('the', 914),
 ('good', 837),
 ('for', 706),
 ('a', 695),
 ('not', 624),
 ('and', 581),
 ('best', 568),
 ('my', 451),
 ('coffee', 438)]

Map-reduce with Spark

In[22]:

from collections import Counter

bad_tokens = set(('.', '!'))

def bigrams(p):
    key = p.reviewerid
    tokens = tuple(x.lower() \
                   for x in p.reviewsummary.split() \
                   if x not in bad_tokens)
    ct = Counter('%s %s' % (tokens[i], tokens[i+1]) \
                 for i in range(len(tokens)-1))
    return ((key, k, v) for k, v in ct.items())

Map-reduce with Spark

In[23]:

names = (review
         .rdd
         .flatMap(bigrams))
names.takeOrdered(10, key = lambda x: -x[2])

Out[23]:

[('A1DTDSL1J6C05F', 'love love', 2),
 ('A2LS9RBSI8SDZF', 'can not', 2),
 ('A2A1XYSB692L6J', 'a r', 2),
 ('A2A1XYSB692L6J', 'h i', 2),
 ('A212N0TUR007EK', "'made in", 2),
 ('ABTUNH7645QJL', 'a lot', 2),
 ('ABTUNH7645QJL', 'lot of', 2),
 ('ABTUNH7645QJL', 'not a', 2),
 ('A1QEZ01P4PH82D', 'hot hot', 2),
 ('A2U51T1TK6CYX9', '2 different', 2)]

Map-reduce with Spark

In[24]:

names = (review
         .rdd
         .flatMap(bigrams)
         .map(lambda x: (x[1], x[2]))
         .reduceByKey(lambda a,b: a+b))
names.takeOrdered(15, key = lambda x: -x[1])

Out[24]:

[('the best', 208),
 ('great for', 88),
 ('but not', 86),
 ('great product', 84),
 ('love this', 75),
 ('my favorite', 73),
 ('very good', 73),
 ('a great', 71),
 ('this is', 70),
 ('of the', 69),
 ('good for', 62),
 ('gluten free', 58),
 ('for the', 57),
 ('better than', 56),
 ('dog food', 55)]

SQL, Map-reduce with Spark

In[25]:

sql = """
SELECT *
FROM review
WHERE reviewscore < 2
"""
lowscore = sqlcontext.sql(sql)
names = (lowscore
         .rdd
         .flatMap(bigrams)
         .map(lambda x: (x[1], x[2]))
         .reduceByKey(lambda a,b: a+b))
names.takeOrdered(5, key = lambda x: -x[1])

Out[25]:

[('not good', 16),
 ('do not', 12),
 ('in the', 12),
 ('the worst', 11),
 ('waste of', 10)]

The Finale

Python + R + Spark + SQL

In [26]:

## --- SQL ---
sql = "SELECT * FROM review WHERE reviewscore < 2"
lowscore = sqlcontext.sql(sql)
## --- Spark ---
names = (lowscore
         .rdd
         .flatMap(bigrams)
         .map(lambda x: (x[1], x[2]))
         .reduceByKey(lambda a,b: a+b))
lst = names.takeOrdered(20, key = lambda x: -x[1])
## --- R (from Python) ---
from rpy2.robjects.vectors import StrVector, IntVector
dataf = dplyr.DataFrame({'word_pair': StrVector([x[0] for x in lst]),
                         'count': IntVector([x[1] for x in lst])})
p = (gg.ggplot(dataf.arrange('desc(count)')) +
     gg.geom_bar(gg.aes_string(x='factor(word_pair, levels=word_pair)',
                               y='count'),
                 stat="identity") +
     gg.scale_x_discrete('word pair') +
     gg.theme(**{'axis.text.x': gg.element_text(angle = 90, hjust = 1)}))

with grdevices.render_to_bytesio(grdevices.png, 
                                 width=800, height=600, 
                                 res=120) as b:
    p.plot()
display(Image(data=b.getvalue(), format='png', embed=True))

There would be more to cover...

Transpiler to C for (runtime) speed: Cython
JIT-compiling to LLVM for (runtime) speed: llvmlite
(Web) services: Thrift

In Summary

Data analysis needs better collectively covered by several languages
Use what is already out there. Be pragmatic.

Julia	⇒	Python	:	PyCall
Julia	⇒	R	:	Rif or RCall
[somewhere]	⇒	R	:	Rserve
R	⇒	Python	:	rPython

Pragmatic

Polyglot Data Analysis

Laurent Gautier - Big Data Bootcamp Boston 2016

Slides

Acknowledgments

This talk contains edible art runnable code

The easy way:

First, two questions. Who thinks that:

- mixing languages is a good idea ?

- is NOT a good idea ?

Is this familiar ?

Hardly groundbreaking.

That's polyglot data analysis though.

You* probably have already done polyglot data analysis

You should not feel bad about it

Data in - Insights out

Data science: a language landscape

(alphabetical order - not comprehensive)

Data in tables

Fine foods

Work with a relational database

Python and SQL

Object-relational mapping (ORM)

Python and ORMs

⇩

Set up

Create classes through reflection

Query as chained calls

Chained calls generate SQL

Back to tables

dplyr

But... we started with Python, didn't we ?

Python vs R anyone ?

PythonR !

Just import the R package from Python

Query as chained calls

While we are with R, what about figures ?

What about Javascript then ?

Spark

pyspark

Setup

From SQL table to Spark DataFrame

Running our query

(Lazy) evaluation

Map-reduce with Spark

Map-reduce with Spark

Map-reduce with Spark

Map-reduce with Spark

SQL, Map-reduce with Spark

The Finale

There would be more to cover...

In Summary

You^* probably have already done polyglot data analysis

Python^R !