class: center, middle, inverse, title-slide # Reticulate: ## una historia de dos lenguajes ###
Antonio Alvarez ###
07.11.2020
--- layout: false class: bg-lime split-30 hide-slide-number .column[ ] .column.slide-in-right[.content.vmiddle[ .sliderbox.shade_main.pad1[ .font5[¡Bienvenidos!] ] ]] --- class: hide-slide-number # .teal[El mundo está cambiando y cambiara mas...] <br> ### - Las aplicaciones y el análisis a los datos está en constante actualización ### - Para los analistas, es importante estar familiarizado con las prácticas adecuadas en su campo ### - El problema viene _siempre_ primero que las soluciones --- class: middle center bg-lime hide-slide-number # ¿Qué escojo? <img src="img/larry.gif", width="60%"> --- layout: true class: split-two with-thick-border border-white hide-slide-number .column.bg-teal[.content[ <br> # .white[.center[¿Por qué Python?]] <br> ### - Lenguaje de propósito general con usos más allá del análisis de datos ### - Facilidad de despliegue de aplicaciones y productos informáticos ### - Alta legibilidad entre los lenguajes computacionales <br> .center[ <img src="img/py_logo.png", width="40%"> ] ]] .column.bg-light-blue[.content[ <br> # .white[.center[¿Por qué R?]] <br> ### - Lenguaje específico para Análisis estadístico ### - Vectorización es parte de su diseño i.e. Algebra Lineal ### - Alto número de paquetes de conocimiento especifico <br> .center[ <img src="img/r_logo.png", width="50%"> ] ]] --- class: hide-col2 count:false --- class:hide-col1 count:false --- layout: false class: middle center bg-lime hide-slide-number <img src="img/dos.gif", width="60%"> --- class: bg-main6 hide-slide-number # .center[Reticulate 🐍] <br> ### - Paquete que conecta el ambiente de Python con R y viceversa ### - Objetos y funciones pueden ser traducidos y ser usados por ambos lenguajes ### - Personalizable en la versión de Python que se va usar ### - R Markdowns, cuadernos de Jupyter, todo con alto rendimiento inter-operativo .center[ <img src="img/r_py.png", width="55%"> ] --- class: split-two with-thick-border border-white hide-slide-number .column.bg-teal[ .split-three[ .row[.content.vmiddle[ ]] .row[.content.vmiddle[ ]] .row[.content.vmiddle[ ]] ] ] .column.bg-light-blue[.content.vmiddle.center[ # .white[¡3 sencillos pasos!] ]] --- class: split-two with-thick-border border-white hide-slide-number .column.bg-teal[ .split-three[ .row[.content.vmiddle[ ### - Instalar R y RStudio ]] .row[.content.vmiddle[ ]] .row[.content.vmiddle[ ]] ] ] .column.bg-light-blue[.content.vmiddle.center[ # .white[¡3 sencillos pasos!] ]] --- class: split-two with-thick-border border-white hide-slide-number .column.bg-teal[ .split-three[ .row[.content.vmiddle[ ### - Instalar R y RStudio ]] .row[.content.vmiddle[ ### - Instalar `reticulate` ]] .row[.content.vmiddle[ ]] ] ] .column.bg-light-blue[.content.vmiddle[ ```r install.packages("reticulate") library(reticulate) ``` ]] --- class: split-two with-thick-border border-white hide-slide-number .column.bg-teal[ .split-three[ .row[.content.vmiddle[ ### - Instalar R y RStudio ]] .row[.content.vmiddle[ ### - Instalar `reticulate` ]] .row[.content.vmiddle[ ### - Instalar `tidyverse`(recomendado) ]] ] ] .column.bg-light-blue[.content.vmiddle[ ```r install.packages("tidyverse") library(tidyverse) ``` ]] --- class: bg-main6 hide-slide-number # Configuración de reticulate ### - `reticulate` tiene su propia instalación de Python que viene con Conda, si es que no detecta una versión de Python ### - Podemos especificar una variable global en el archivo .Renviron, localizado en el directorio Documentos ```r RETICULATE_PYTHON="C:/Users/Tukey/AppData/Local/Programs/Python/Python38/python.exe" ``` ### - Se usara esta versión de Python como default ### - Cada proyecto también puede tener su propia versión de Python --- class: bg-teal # .white[Conversión de datos] .center[ <img src="img/tipos.png", width="100%"> ] --- class: middle hide-slide-number # .green[Llamando a Python] <br> <br> ## 1. Python en Rmarkdown 📚 <br> ## 2. Importar modulos de Python 🏤 <br> ## 3. Llamar scripts de Python 🗳 <br> ## 4. Acceso al REPL de Python 📟 --- layout: false class: bg-lime split-30 hide-slide-number .column[ ] .column.slide-in-right[.content.vmiddle[ .sliderbox.shade_main.pad1[ .font5[Ejemplo: Parseo en R, modelando con Sci-kit] ] ]] --- class: split-two hide-slide-number .column.bg-light-blue[.content[ <br> # .white[Rapida introducción a RMarkdown] <br> <br> <br> ### Jupyter de R ### Usa snippets para combinar el código con texto escrito ### Puede generar varios tipos de archivos: desde PDFs y Powerpoints, hasta páginas webs y dashboards ]] .column.bg-black[.content.vmiddle[ <img src="img/snip_r.png", width="80%"> <br> ### - .red[Lenguaje] ### - .green[Nombre del snippet] ### - .orange[Opciones de impresión de knitr] <br> <img src="img/snip_py.png", width="80%"> ]] --- class: split-70 with-thick-border border-black hide-slide-number .column.bg-indigo[.content[ ```r library(tidyverse) library(recipes) library(reticulate) #use_python("ruta/a-otra/ver/python.exe") df <- readxl::read_xls("data/sleep.xls") ``` ``` ## [1] 445 53 ``` ``` ## ID Age_Group Sex ## "character" "double" "character" ## Total_years_dispatcher Total_years_present_job Job_type ## "double" "double" "double" ## Marital_Status Childrendependents Children_under_2_yrs ## "character" "double" "double" ## Caff_Beverages If_yes_how_many_daily Sick_Days_in_last_year ## "double" "double" "double" ## Health_status Older_Younger_or_Same Diagnosed_Sleep_disorder ## "double" "double" "double" ## Sleep_Apnea Medical_treatment SunStart_Time ## "double" "double" "double" ## MonStart_Time TuesStart_Time WedStart_Time ## "double" "double" "double" ## ThuStart_Time FriStart_Time SatStart_Time ## "double" "double" "double" ## SunEnd_Time MonEnd_Time TuesEnd_Time ## "double" "double" "double" ## WedEnd_Time ThuEnd_Time FriEnd_Time ## "double" "double" "double" ## SatEnd_Time job_schedule Relief_schedule ## "double" "double" "character" ## Avg_Work_Hrs_Week ## "double" ``` ]] .column.bg-cyan[.content.vmiddle.center[ # Set-up ]] --- class: split-70 with-thick-border border-black hide-slide-number .column.bg-indigo[.content[ ```r library(tidyverse) library(recipes) library(reticulate) #use_python("ruta/a-otra/ver/python.exe") df <- readxl::read_xls("data/sleep.xls") ``` ```r sleep <- df %>% select( Diagnosed_Sleep_disorder, Age_Group, Sex, Total_years_dispatcher, Total_years_present_job, Marital_Status, Childrendependents, Children_under_2_yrs, Caff_Beverages, Sick_Days_in_last_year, Health_status, Avg_Work_Hrs_Week, FRA_report, Phys_Drained, Mentally_Drained, Alert_at_Work, Job_Security ) %>% rename_all(tolower) %>% mutate_if(is.character, as.numeric) %>% mutate_at(vars(diagnosed_sleep_disorder, sex, caff_beverages, fra_report), ~ -(. - 2)) %>% mutate_at(vars(marital_status), ~ (. - 1)) %>% drop_na() ``` ]] .column.bg-cyan[.content.vmiddle.center[ # Parseo ]] --- class: split-70 with-thick-border border-black hide-slide-number .column.bg-indigo[.content[ ```r numeric_variables <- c( "total_years_dispatcher", "total_years_present_job", "childrendependents", "children_under_2_yrs", "sick_days_in_last_year", "avg_work_hrs_week" ) factor_variables <- setdiff(colnames(sleep), numeric_variables) sleep <- mutate_at(sleep, vars(factor_variables), as.factor) set.seed(2001) index <- sample(1:nrow(sleep), floor(nrow(sleep) * .75)) sleep_train <- sleep[index, ] sleep_test <- sleep[-index, ] ``` ]] .column.bg-cyan[.content.vmiddle.center[ # Preparación ]] --- class: split-70 with-thick-border border-black hide-slide-number .column.bg-indigo[.content[ ```r numeric_variables <- c( "total_years_dispatcher", "total_years_present_job", "childrendependents", "children_under_2_yrs", "sick_days_in_last_year", "avg_work_hrs_week" ) factor_variables <- setdiff(colnames(sleep), numeric_variables) sleep <- mutate_at(sleep, vars(factor_variables), as.factor) set.seed(2019) index <- sample(1:nrow(sleep), floor(nrow(sleep) * .75)) sleep_train <- sleep[index, ] sleep_test <- sleep[-index, ] ``` ```r recipe_formula <- recipe(diagnosed_sleep_disorder ~ ., sleep_train) recipe_steps <- recipe_formula %>% step_dummy(factor_variables, -all_outcomes(), one_hot = TRUE) %>% themis::step_downsample(diagnosed_sleep_disorder) %>% step_center(numeric_variables) %>% step_scale(numeric_variables) recipe_prep <- prep(recipe_steps, sleep_train, retain = TRUE) training_data <- juice(recipe_prep) testing_data <- bake(recipe_prep, sleep_test) ``` ]] .column.bg-cyan[.content.vmiddle.center[ # Receta ]] --- class: middle center bg-lime hide-slide-number <img src="img/get-on.gif", width="60%"> --- class: split-30 with-thick-border border-black hide-slide-number .column.bg-lime[.content.vmiddle.center[ # Set up ]] .column.bg-teal[.content[ ```python import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from sklearn import svm from sklearn.model_selection import GridSearchCV, cross_val_score from sklearn.metrics import classification_report, confusion_matrix, accuracy_score ``` ]] --- class: split-30 with-thick-border border-black hide-slide-number .column.bg-lime[.content.vmiddle.center[ # División ]] .column.bg-teal[.content[ ```python import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from sklearn import svm from sklearn.model_selection import GridSearchCV, cross_val_score from sklearn.metrics import classification_report, confusion_matrix, accuracy_score ``` ```python y_train = r.training_data['diagnosed_sleep_disorder'] X_train = r.training_data.drop('diagnosed_sleep_disorder', axis = 1) y_test = r.testing_data['diagnosed_sleep_disorder'] X_test = r.testing_data.drop('diagnosed_sleep_disorder', axis = 1) ``` ]] --- class: split-30 with-thick-border border-black hide-slide-number .column.bg-lime[.content.vmiddle.center[ # Afinación ]] .column.bg-teal[.content[ ```python import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from sklearn import svm from sklearn.model_selection import GridSearchCV, cross_val_score from sklearn.metrics import classification_report, confusion_matrix, accuracy_score ``` ```python y_train = r.training_data['diagnosed_sleep_disorder'] X_train = r.training_data.drop('diagnosed_sleep_disorder', axis = 1) y_test = r.testing_data['diagnosed_sleep_disorder'] X_test = r.testing_data.drop('diagnosed_sleep_disorder', axis = 1) ``` ```python param_grid = [{'C': [0.01, 0.1, 1, 10, 100], 'gamma': [0.001, 0.01, 0.1, 1, 10], 'kernel': ['rbf', 'linear']}] grid = GridSearchCV(svm.SVC(), param_grid, cv = 5, scoring = 'balanced_accuracy') grid.fit(X_train, y_train) ``` ``` ## GridSearchCV(cv=5, estimator=SVC(), ## param_grid=[{'C': [0.01, 0.1, 1, 10, 100], ## 'gamma': [0.001, 0.01, 0.1, 1, 10], ## 'kernel': ['rbf', 'linear']}], ## scoring='balanced_accuracy') ``` ]] --- class: split-30 with-thick-border border-black hide-slide-number .column.bg-lime[.content.vmiddle.center[ # Mejores parámetros ]] .column.bg-teal[.content[ ```python import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from sklearn import svm from sklearn.model_selection import GridSearchCV, cross_val_score from sklearn.metrics import classification_report, confusion_matrix, accuracy_score ``` ```python y_train = r.training_data['diagnosed_sleep_disorder'] X_train = r.training_data.drop('diagnosed_sleep_disorder', axis = 1) y_test = r.testing_data['diagnosed_sleep_disorder'] X_test = r.testing_data.drop('diagnosed_sleep_disorder', axis = 1) ``` ```python param_grid = [{'C': [0.01, 0.1, 1, 10, 100], 'gamma': [0.001, 0.01, 0.1, 1, 10], 'kernel': ['rbf', 'linear']}] grid = GridSearchCV(svm.SVC(), param_grid, cv = 5, scoring = 'balanced_accuracy') grid.fit(X_train, y_train) ``` ```python print(grid.best_params_) ``` ``` ## {'C': 0.01, 'gamma': 0.001, 'kernel': 'rbf'} ``` ]] --- class: bg-teal hide-slide-number # .white[Reporte de Confusión] ```python clf = grid.best_estimator_ y_pred = clf.predict(X_test) print('Matriz de la Confusión:\n\n', confusion_matrix(y_test, y_pred)) ``` ``` ## Matriz de la Confusión: ## ## [[67 27] ## [ 6 4]] ``` ```python print('\nReporte de clasificación:\n\n', classification_report(y_test, y_pred)) ``` ``` ## ## Reporte de clasificación: ## ## precision recall f1-score support ## ## 0 0.92 0.71 0.80 94 ## 1 0.13 0.40 0.20 10 ## ## accuracy 0.68 104 ## macro avg 0.52 0.56 0.50 104 ## weighted avg 0.84 0.68 0.74 104 ``` --- class: bg-teal hide-slide-number # .white[Reporte de Confusión] ```python print('\nPrecisión training: {:.2f}'.format(clf.score(X_train, y_train))) ``` ``` ## ## Precisión training: 0.73 ``` ```python print('\nPrecisión test: {:.2f}'.format(clf.score(X_test, y_test))) ``` ``` ## ## Precisión test: 0.68 ``` ### .white[Aún tenemos espacio para mejorar...] --- class: split-two hide-slide-number .column.bg-teal[.content[ <br> # .white[Grafiquemos la confusión] <br> ```python conf_mat = confusion_matrix(y_test, y_pred) sns.heatmap(conf_mat, square = True, annot = True, fmt = 'g', cbar = False, cmap = 'viridis') plt.xlabel('Predicho') plt.ylabel('Observado') plt.show() ``` ]] .column.bg-lime[.content.vmiddle[ <img src="index_files/figure-html/unnamed-chunk-24-1.png" width="576" /> ]] --- class: middle center bg-lime hide-slide-number # ¡Lo logramos! <img src="img/prettygood.gif", width="60%"> --- class: split-two with-thick-border border-white hide-slide-number .column.bg-teal[ .split-three[ .row[.content.vmiddle[ ### .white[Podemos llamar objetos de Python a R] ]] .row[.content.vmiddle[ ]] .row[.content.vmiddle[ ]] ] ] .column.bg-light-blue[.content[ <br> # Otros detalles <br> ```r py$conf_mat ``` ``` ## [,1] [,2] ## [1,] 67 27 ## [2,] 6 4 ``` ]] --- class: split-two with-thick-border border-white hide-slide-number .column.bg-teal[ .split-three[ .row[.content.vmiddle[ ### .white[Podemos llamar objetos de Python a R] ]] .row[.content.vmiddle[ ### .white[Podemos importar los módulos de Python y usar sus funciones] ]] .row[.content.vmiddle[ ]] ] ] .column.bg-light-blue[.content[ <br> # Otros detalles <br> ```r tf <- import("tensorflow") mujeres <- tf$constant(150, name = "Empleadas") hombres <- tf$constant(135, name = "Empleados") total <- tf$add(mujeres, hombres) total ``` ``` ## tf.Tensor(285.0, shape=(), dtype=float32) ``` ]] --- class: split-two with-thick-border border-white hide-slide-number .column.bg-teal[ .split-three[ .row[.content.vmiddle[ ### .white[Podemos llamar objetos de Python a R] ]] .row[.content.vmiddle[ ### .white[Podemos importar los módulos de Python y usar sus funciones] ]] .row[.content.vmiddle[ ### .white[Podemos usar la consola de R para ejecutar el REPL de Python] ]] ] ] .column.bg-light-blue[.content[ <br> # Otros detalles .center[ <img src="img/repl_py.png", width="77%"> ] ]] --- layout: false class: bg-lime split-30 hide-slide-number .column[ ] .column.slide-in-right[.content.vmiddle[ .sliderbox.shade_main.pad1[ .font5[¡Gracias!] ] ]] --- class: bg-main6 hide-slide-number # .orange[Agradecimientos] <br> ## - A la Sociedad Ecuatoriana Estadística ## - A Yihui Xie y a Emi Tanaka por los paquetes de `xaringan` y el tema `kunoichi` ## - A los colaboradores RStudio por desarrollar `reticulate` 🐍 .center[ <img src="img/thank.gif", width="30%"> ]