pValues

In the last years a rising concern on the proper use of p-values has impacted the scientific community. The wrong usage of this measurement, has lead to a biased results and conclusions that urged the comunity to review the right implementation of this metric and in some more extreme cases to completely discourage its use. The ASA (American Statistics Association) in an effort to delimit the scope and limitations of such, released less than a year ago “The ASA’s Statement on P-values: Context, Process and Purpose”.[1]

In many papers of a very wide comunity of scientist, statements such as “With a significance of 99.95%, we conclude… “ became widely accepted, even when methods and conclusions were questionable or insufficient to make such strong statements. As the article recalled, a question and discussion regarding the metric was summarized in the following anecdotical quote: “Why we use p-values? Because we were thaught to use them. Why we were taught to use them, because its what the scientist use”. This lead to a great need of principles and non-technical descriptions that could help scientist to use statistician tools properly.

There are 6 key principles addessed by the statement:

  • p-Values can indicate how incompatible the data are with specified statiscal model.

    A p-Value provides a metric that measures the compatibility between a particular set of data and a proposed model for the data. For instance the null hypothesis, where the smaller the p-Value is the greater incompatibility of the data with the null hypothesis. This could provide evidence against the null hypothesis or the underlying assumptions.

  • p-Values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone

    One of the common mistakes reaserchers make is to understand the p-Value as a truth measurement of the null hypothesis Or as the probability that random chance produced the observed data. p-Values are neither. It is a statement of relationship between the data and a specified hypothetical explanation.

  • Scientific conclusions and business or policy decisions should not be based only on wheter a p-value passes a specific threshold

    The common based statement that “statistical significance” (p≤ 0.05) may lead to a distortion of the reality and when is presented without contextual information about the methods or data collection procedures, may lead to wrong or imprecise assumptions. By themselves p-values can not assure that some conclusion or policy is correct based on the significance of the p-value.

  • Proper inference requires full reporting and transparency

    Cherry picking promising results or practices such as data dredging, significance chasing, selective inference, should be vigourosly avoided.

  • A p-Value, or statistical significance, does not measure the size of an effect or the importance of the result

    Usually the “95%” , “99.95%” or “99.99%”. Significance in a statement may lead wrongly to believe that this is a measurement of the importance of the study. However statistician significance is not equivalent to scientific, human or economic significance. For example, small p-values do not necessary imply the presence of larger or more important effects. Any effect , no matter how tiny, can produce a small p-value. Is also important to note that small sizes or imprecise measurements can lead to small or large p-values.

  • By itself a p-value does not provide a good measure of evidence regarding a good measure of evidence regarding a model or hypothesis

    p-Values by itself provides poor and limited information. The model in consideration maybe innadequate for studying a particular data. Then an small p-value can be wrongly assumed to be a good measurement for testing the hypothesis, when other models are more accurate or consistent with the data in consideration.

These very simple but profound principles, aim to improve and clarifify the statistical usage of p-values or at least help researchers to avoid presumptuos usage of language when using them alone. This also is a good call to the scientific comunity to always rely on the scientific method and try not to use shortcuts such as only using the p-values inadecuately to prove hypothesis. In conclusion as the article itself finishes: “ No single index should be substitute for scientific reasoning.”

Resources:

[1] R. L. Wasserstein & Nicole A. Lazar, The ASA’s Statement on P-values: Context, Process and Purpose, The American Statistician, 70:2, 129-133.

Watershed Transformation:

The watershed is a classical algorithm used for segmentation, that is, for separating different objects in an image.

Starting from user-defined markers, the watershed algorithm treats pixels values as a local topography (elevation). The algorithm floods basins from the markers, until basins attributed to different markers meet on watershed lines. In many cases, markers are chosen as local minima of the image, from which basins are flooded.

Advantages:

  • Simplicity
  • speed
  • complete division
  • provide close contourns.

Drawbacks

  • Oversegmentation
  • Sensitivity to Noise.
  • Poor detection importatn areas with low contrats boundaries.

Algorithms:

  • Pengi : Multidegree Inmmersion simulation, based on Vincent and Soille.
  • Hsie h: Was aiming to small moving object detection. First, a noise removal technique is applied.
  • Frucci and Baja: Looks for homogenous regions, and there is a sucessive assignment of the regions: “Foreground” and “Background”.

  • Hamarneh and Li: Prior a shape appeareance knowledge is applied, then clustering-merging to improve in the oversegmentation.

  • Cheng: Uses this technique combined with segmentation using graph algorithms.

  • Smouvi and Masmoudi: Introduce a histogram.

  • Zanaty-AFifif Presented a seed region growing with image entropy.
1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import numpy as np
import matplotlib.pyplot as plt
from scipy import ndimage as ndi

from skimage.morphology import watershed
from skimage.feature import peak_local_max
from skimage.data import camera

# Generate an initial image with two overlapping circles
x, y = np.indices((80, 80))
x1, y1, x2, y2 = 28, 28, 44, 52
r1, r2 = 16, 20
mask_circle1 = (x - x1)**2 + (y - y1)**2 < r1**2
mask_circle2 = (x - x2)**2 + (y - y2)**2 < r2**2
image = np.logical_or(mask_circle1, mask_circle2)
image=camera()
# Now we want to separate the two objects in image
# Generate the markers as local maxima of the distance to the background
distance = ndi.distance_transform_edt(image)
local_maxi = peak_local_max(distance, indices=False, footprint=np.ones((3, 3)),
                            labels=image)
markers = ndi.label(local_maxi)[0]
labels = watershed(-distance, markers, mask=image)

fig, axes = plt.subplots(ncols=3, figsize=(9, 3), sharex=True, sharey=True,
                         subplot_kw={'adjustable': 'box-forced'})
ax = axes.ravel()

ax[0].imshow(image, cmap=plt.cm.gray, interpolation='nearest')
ax[0].set_title('Overlapping objects')
ax[1].imshow(-distance, cmap=plt.cm.gray, interpolation='nearest')
ax[1].set_title('Distances')
ax[2].imshow(labels, cmap=plt.cm.spectral, interpolation='nearest')
ax[2].set_title('Separated objects')

for a in ax:
    a.set_axis_off()

fig.tight_layout()
plt.show()

png

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# Generate an initial image with two overlapping circles
x, y = np.indices((80, 80))
x1, y1, x2, y2 = 28, 28, 44, 52
r1, r2 = 16, 20
mask_circle1 = (x - x1)**2 + (y - y1)**2 < r1**2
mask_circle2 = (x - x2)**2 + (y - y2)**2 < r2**2
image = np.logical_or(mask_circle1, mask_circle2)
# Now we want to separate the two objects in image
# Generate the markers as local maxima of the distance to the background
distance = ndi.distance_transform_edt(image)
local_maxi = peak_local_max(distance, indices=False, footprint=np.ones((3, 3)),
                            labels=image)
markers = ndi.label(local_maxi)[0]
labels = watershed(-distance, markers, mask=image)

fig, axes = plt.subplots(ncols=3, figsize=(9, 3), sharex=True, sharey=True,
                         subplot_kw={'adjustable': 'box-forced'})
ax = axes.ravel()

ax[0].imshow(image, cmap=plt.cm.gray, interpolation='nearest')
ax[0].set_title('Overlapping objects')
ax[1].imshow(-distance, cmap=plt.cm.gray, interpolation='nearest')
ax[1].set_title('Distances')
ax[2].imshow(labels, cmap=plt.cm.spectral, interpolation='nearest')
ax[2].set_title('Separated objects')

for a in ax:
    a.set_axis_off()

fig.tight_layout()
plt.show()

png

1
2
3
4
5
6
7
from skimage.segmentation import watershed
image=camera()
markers = np.zeros_like(image)
markers[image < 30] = 1
markers[image > 150] = 2
segme= watershed(image, markers)
plt.imshow(segme)
1
<matplotlib.image.AxesImage at 0x116d34438>

png

1

Region Growing

It comes from the MRI analysis.

  • Addams And Bischof:

Seeded region growing. First order dependency whenever several pixels have the same difference to their neighborghs. And second order dependey one pixel has teh same difference measured to several regions.

Clustering:

A cluster is a collection of object which are “similar”. To define similarity we can use similatriy measure algorithms such as spatial neighbourghoods.

Drawbacks:

  • Time
  • Complexity
  • Unwanted smoothing.

Algorithms :

  • Ahmed: He developed a Neighbourghood averagin additive term. And an algorithm bias corrected called BCFCM.

  • Liew and Yan: They look the spatial constraint and modeled the inhomogenity by b-spline surfaces. For checking voxel connectivity dissimilarity index are used.

  • Szilagyi: Used EMFCM algorithm to accelerate the image segmentation process.

Tresholding:

This technique is one of the simplest in image segmentation, it aims to change each pixel to black or white depending on its comparison with some fixed value. Usually for the picking of this value, several tecnhiques can be used:

  • Histogram Shape
  • Clustering
  • Entropy
  • Object Attributes
  • Spacial distribution
  • Local changes of a region.

Some known algorithms and their results are:

Algorithms:

Otsu’s:

After calculating a histogram, an analysis of the distribution of the gray values of an image are taken into an account. The minimun between the two peaks in a bimodal histogram is chosen as threshold. Some of its drawbacks is that generally a single threshold can’t give a good segmentation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import matplotlib
import matplotlib.pyplot as plt

from skimage.data import camera
from skimage.filters import threshold_otsu,threshold_li,threshold_adaptive

matplotlib.rcParams['font.size'] = 9


image = camera()
thresh = threshold_otsu(image)
threshli = threshold_li(image)
binary = image > thresh

#fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(8, 2.5))
fig = plt.figure(figsize=(8, 2.5))
ax1 = plt.subplot(1, 3, 1, adjustable='box-forced')
ax2 = plt.subplot(1, 3, 2)
ax3 = plt.subplot(1, 3, 3, sharex=ax1, sharey=ax1, adjustable='box-forced')

ax1.imshow(image, cmap=plt.cm.gray)
ax1.set_title('Original')
ax1.axis('off')

ax2.hist(image)
ax2.set_title('Histogram')
ax2.axvline(thresh, color='r')

ax3.imshow(binary, cmap=plt.cm.gray)
ax3.set_title('Thresholded')
ax3.axis('off')

plt.show()

png

1
%matplotlib inline
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from skimage.data import camera
def plot(thresh):
    image = camera()
    binary = image > thresh
    fig = plt.figure(figsize=(8, 2.5))
    ax1 = plt.subplot(1, 3, 1, adjustable='box-forced')
    ax2 = plt.subplot(1, 3, 2)
    ax3 = plt.subplot(1, 3, 3, sharex=ax1, sharey=ax1, adjustable='box-forced')

    ax1.imshow(image, cmap=plt.cm.gray)
    ax1.set_title('Original')
    ax1.axis('off')

    ax2.hist(image)
    ax2.set_title('Histogram')
    ax2.axvline(thresh, color='r')

    ax3.imshow(binary, cmap=plt.cm.gray)
    ax3.set_title('Thresholded')
    ax3.axis('off')

    plt.show()

Here is an example of how to put an interactive bar, for changing the threshold in a jupyter notebook.

1
2
3
4
5
6
7
from IPython.html.widgets import interact,IntSlider



interact(plot,
               thresh=IntSlider(min=0,max=300)
               )

png

1
<function __main__.plot>

Li:

Proposes to use a 2d histogram.

1
2
3
threshli = threshold_li(image)
plot(thresh)
plot(threshli)

png

png

Composite Methods

  • Huang Chau:

Ge propossed to compute first a gaussian mixture model or EM, and then look for the threshold average of means.

  • Maitra chatterje:

BActerial foraging.

  • Rogowska :

Proposed to divide an image into groups:

1
2
3
* Less 
* Equal 
* Mayor Some preprocessing might be required. 
  • Zhang :

First automatic, fast, robust and accurate segmentation for bones. Using a 3d adaptative thresholding. Basically diferencing bone and no bone images. It uses iterative 3d correlation to validate voxel clasification.

  • Vijay:

Used an adaptative filter, reducing the noises but adding some blur. Then implemented an adaptative wavelet thresholding was applied with multiscale product rule in the second step.

1
2
3
4
from skimage.data import camera
from skimage.filters import try_all_threshold
fig, ax = try_all_threshold(camera(), figsize=(50, 50), verbose=False)

png

Resources:

[1] E.A. Zanaty, Medical Image Segmentation Techniques: An Overview,Taif University, Saudi Arabia, Overview, 2016.

[2] Stéfan van der Walt, Johannes L. Schönberger, Juan Nunez-Iglesias, François Boulogne, Joshua D. Warner, Neil Yager, Emmanuelle Gouillart, Tony Yu and the scikit-image contributors. scikit-image: Image processing in Python. PeerJ 2:e453 (2014)

Tight Words Problem

Given an alphabet {0,1,…,k} with \(0 \leq k \leq\). We define a word of length \(1 \leq k \leq n\), over this alphabet, to be tight if the difference between any par of adjacent digits in the word is no major than 1.

For this problem we will use dynammic programming in the way that we build a table from the bottom up, where we can look the previous answers.

Strategy:

We will look at the ending words, the length of the word and the alphabet chosen. Let us define the following function:

\[\#TW(N,K,END)\]

As the number of tight words of length N, over and alphabet {0,1,…,k} (K the last letter), that ends in the letter END.

Preparation and Base Cases

There are some base cases that we need to look at first, in order to build our solution.

When the alphabet only has one letter “0”

1
2
	for  n=1 to N+1: 
		dp[(n,0,0)]= 1 

Because there is only one word, which is produced by \(0^*\)

Words of length 1

1
2
3
	for end =0 to 10 :
		for k = 0 to 10 :
			dp[(1,k,end)]= 1 if end <=k else 0 

The main loop:

Now we have just to move along the other cases:

Let us start by taking every word that ends on 0. Here we have two cases either we put another zero at the end of the word or we put a 1. So we first take the case when ends in zero and if k is greater or equal to 1, we just add this quantity to that case.

1
2
3
4
5
6
7
	
	for n in xrange(2, N + 1):
       for k in xrange(0, K + 1):
           dp[(n, k, 0)] = dp[(n - 1, k, 0)]
           if 1 <= k:
               dp[(n, k, 0)] += dp[(n - 1, k, 1)]

This is in the same loop.

Now we have to consider the endings if the end is withing the characters of the alphabet, we can either add the same letter, or add the previous one, and if we are not in the limit we can even add the next letter.

1
2
3
4
5
6
7
	for end in range(1, k + 1):
        dp[(n, k, end)] = dp[(n - 1, k, end)] + \
            dp[(n - 1, k, end - 1)]
        if end + 1 <= k:
            dp[(n, k, end)] += dp[(n - 1, k, end + 1)]

At the end, we just need to consider the last extreme case. When we pick up an alphabet that goes all the way to the last character. In this case, similar to the first one, we can either add the same letter or add the previous one, because there are no more letters upfront.

1
2
3
4
5
	if k == 9:
        dp[(n, k, 9)] = dp[(n - 1, k, 9)] + dp[(n - 1, k, 8)]


Finalization:

On this stage we just go for the different endings that our alphabet may have and sum up all of them:


	total= sum(dp[N,k,end] for end in range(0,k+1))

Una capacitación diseñada por: Iván Felipe Rodríguez. Para el Colegio Liceo Hermano Miguel La Salle.

Jupyter

Jupyter es un notebook interactivo. Esto quiere decir que con él podrás escribir tus trabajos, presentaciones, evaluaciones y a la vez escribir código. Esto te permitirá presentar trabajos de forma más elegante,entendible e interactiva. Usando jupyter podrás usar, entre muchos más, R, Python, Julia y Matlab.

Instalación.

  1. Entra a la página: https://www.continuum.io/downloads

  2. Descarga e instala Anaconda. Hay versiones disponibles para Windows, Mac y Linux.

  3. Estamos Listos! Ahora busca Anaconda cmd en programas y cuando abrá, escribe el comando:

    jupyter notebook

Algo parecido a esta imagen aparecerá:

  1. En tu navegador predilecto, puede ser google chrome, se abrirá una ventana con jupyter.

Para conocer más de jupyter entra al siguiente link con información relevante:

http://jupyter.readthedocs.io/en/latest/install.html

Editor de Texto.

Jupyter utiliza Markdown como lenguaje para edición de texto. Esto permite escribir reportes científicos rápidamente al igual que ecuaciones matemáticas de cualquier complejidad, imágenes, tablas y snipets de código lo que propicia un ambiente propicio para escribir literatura científica rápidamente.

La siguiente será una guía extremandamente rápida de como escribir textos sencillos en Jupyter.

Títulos:

Los títulos se escriben usando “#” al inicio seguido de un espacio y del título que se quiera poner. Por ejemplo:

# Título 1

## Titulo 2

### Titulo 3

#### Titulo 4

Insertar Imagenes:

Para insertar imagenes, puedes utilizar el siguiente comando:

1
    ![](ruta-a-la-imagen.extensión)

Por ejemplo, si quiero agregar la imagen del logo de jupyter que tengo en la carpeta Img, con el nombre de jupyter.png, debería hacer lo siguiente:

Insertando Imagen

Sin embargo, en caso de querer ajustar la imagen se debe usar el siguiente comando:

1
    <img src="./images/imagen.extension",width=60,height=60>

Donde width es el ancho de la figura y heigh es el alto.

Por ejemplo, para la misma imagen que agregamos previamente, pero más pequeña, utilizaríamos:

<img src=”./images/jupyter.png”,width=200,height=200>

El mismo principio funciona con videos y gifs como los que he agregado mientras presentaba los ejemplos.

Ecuaciones:

Las ecuaciones en Markdown funcionan bajo Latex. Para insertar una, basta con agregar un signo pesos al principio y al final de la ecuación (

1
$ecuacion_misma línea$
) o agregar dos si queremos ecuación centrada (
1
$$Ecuacion Centrada$$
) :

$Esta \times Ecuación = \frac{Es}{En \cdot la \cdot misma \cdot línea}$ Del texto que queramos incluir.

Mientras que:

\[Esta \times Ecuación = \frac{Esta}{centrada \cdot en \cdot siguiente \cdot línea}.\]

Por ejemplo, algunas ecuaciones complicadas pueden ser escritas de la siguiente manera:

1
2
3
4
5
6
7
    $$
    \mathbf{V}_1 \times \mathbf{V}_2 =      \begin{vmatrix}
    \mathbf{i} & \mathbf{j} & \mathbf{k} \\
    \frac{\partial X}{\partial u} &     \frac{\partial Y}{\partial u} & 0 \\
    \frac{\partial X}{\partial v} &     \frac{\partial Y}{\partial v} & 0
    \end{vmatrix}
    $$

Que resultaría en la ecuación:

\[\mathbf{V}_1 \times \mathbf{V}_2 = \begin{vmatrix} \mathbf{i} & \mathbf{j} & \mathbf{k} \\ \frac{\partial X}{\partial u} & \frac{\partial Y}{\partial u} & 0 \\ \frac{\partial X}{\partial v} & \frac{\partial Y}{\partial v} & 0 \end{vmatrix}\]

También:

1
2
3
$
\frac{1}{\Bigl(\sqrt{\phi \sqrt{5}}-\phi\Bigr) e^{\frac25 \pi}} = 1+\frac{e^{-2\pi}} {1+\frac{e^{-4\pi}} {1+\frac{e^{-6\pi}}{1+\frac{e^{-8\pi}} {1+\ldots} } } }
$

Resultado: \(\frac{1}{\Bigl(\sqrt{\phi \sqrt{5}}-\phi\Bigr) e^{\frac25 \pi}} = 1+\frac{e^{-2\pi}} {1+\frac{e^{-4\pi}} {1+\frac{e^{-6\pi}} {1+\frac{e^{-8\pi}} {1+\ldots} } } }\)

Tal vez escogí ecuaciones que asustan mucho. Pero lo hice intencionalmente solo para mostrar hasta que punto se pueden escribir ecuaciones que en word podrían ser tortuosas.

El siguiente link contiene comandos y sintáxis conocidas de Latex, que te pueden servir cuando quieras escribir tus propias ecuaciones:

https://wch.github.io/latexsheet/latexsheet-0.png

https://wch.github.io/latexsheet/latexsheet-1.png

Atención! :

Para la siguiente actividad por favor inicialice Jupyter, como se indicó en la primera parte de la lección. Luego, busque la carpeta donde tiene este contenido y abra el archivo “Empezando en jupyter.ipynb”

Algo de código:

Si bien esta no es una introducción para código en Julia, quisiera mostrar en este pequeño tutorial ejemplos de código que se puede correr a la vez que se presenta contenido escrito, lo que es único en Jupyter.

El primer paso es escoger nuestro Kernel: Recordemos que podemos encontrar kernels para casi cualquier lenguaje de programación (C, C++, Matlab, Julia, Python, R, entre otros.)

En mi caso, tengo instalados kernels para Matlab, Julia y Python. En el próximo contenido, incluiré instrucciones para asegurar que tengan instalado Julia y su kernel.

Códigos Sencillos:

  1. Todos Sabemos que el numero $\pi$ es 3.14159…. Sin embargo, pocos saben cómo hallar dicho número. Numéricamente existen diferentes técnicas para obtener dicho resultado. Por ejemplo, con simulaciones de monte-carlo.

    En este resultado se ponen números aleatorios en un cuadrado de lado 1 y se cuentan cuántos puntos están a una distancia menor que 0.5 desde el centro del cuadrado. Con suficientes puntos, el promedio dará exactamente $\pi$.

    <img src=”./images/montecarlo.png”,width=190,height=190> El siguiente es el código que realiza esa tarea.

    n = 10000000 #Se define el número de puntos
    
    count = 0  # Se inicializa la variable count, para contar los puntos. 
    for i in 1:n
        u, v = rand(2) # Se buscan do
        d = sqrt((u - 0.5)^2 + (v - 0.5)^2)  # Se Mide la distancia al centro
                                            # del cuadrado.
        if d < 0.5
            count += 1 # se cuentan cuantos son menores que 0.5
        end
    end
    
    area_estimate = 4*count / n # se promedia y se encuentra pi
      3.1422252

Para correr este código lo único que hay que hacer es oprimir play en la parte superior de la pantalla o utilizar el comando ‘Shift+Enter’.

Ejercicio:

Cambie n con diferentes valores y escriba cómo cambian los valores para $\pi$.

n Valor de $\pi $
1000  
10000  
100000  
1000000  
5000000  
  1. Otro código interesante es el utilizado para graficar funciones sencillas, por ejemplo:
using PyPlot #Se llama la librería
x = linspace(0,2*pi,1000);#Se le da el valor al dominio y
y = sin(x); # se declara la función. 
plot(x, y, color="blue",            #Se define el color, el ancho,
linewidth=2.0, linestyle="--")    #y el tipo de línea. 
title("La función Seno") # se le pone un título 

png

1
PyObject <matplotlib.text.Text object at 0x31c048310>

Ejercicio:

Copia y pega el mismo código en las siguientes tres celdas vacías y dibuja las funciones:

  • $4* cos(x)+3$
  • $43* x^3+x^2-2$
  • $4* x * cos(x)* tan(x)$

Preguntas de seguimiento

  • ¿Qué es Jupyter?
  • ¿Cómo se inicializa Jupyter en un computador?
  • ¿Mencione qué es un kernel y cuántos hay disponibles en Jupyter?
  • ¿Cómo se crea un título y en qué cambia aumentar el número de numerales antes del mismo?
  • ¿Cómo se númera?

Apéndice:

Links para descargar kernels:

  • Matlab: https://github.com/Calysto/matlab_kernel
  • Julia : https://github.com/JuliaLang/IJulia.jl
  • C : https://github.com/brendan-rius/jupyter-c-kernel
  • Todos los lenguajes disponibles: https://github.com/ipython/ipython/wiki/IPython-kernels-for-other-languages