Jekyll2021-01-12T10:07:24+00:00/feed.xmlLuigi SelmiThe Hough Transform2021-01-12T00:00:00+00:002021-01-12T00:00:00+00:00/dip/hough_transform<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {
inlineMath: [['$','$'], ['\\(','\\)']],
processEscapes: true
}
});
</script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script>
<p>The <a href="https://en.wikipedia.org/wiki/Hough_transform">Hough transform</a> is used in digital image processing and computer vision to find geometrical shapes such as lines, circles or ellipses, common in images that contain man-made objects. The Hough transform can be used after an image has been pre-processed by an edge detector to find the edges that reveal the border of objects or regions inside it. In this post I will introduce briefly the theory behind the Hough transform, and then I will present two examples , one with images containing simple geometrical shapes, to better explain the idea, and one with an image containing man-made objects. A Jupyter notebook with the Python code used to implement the functions discussed in the post and to derive the pictures shown here is available on my <a href="https://github.com/luigiselmi/datascience/blob/master/python/imaging/hough_transform.ipynb">GitHub repository</a>.</p>
<h2 id="introduction">Introduction</h2>
<p>In digital image processing different filters are available that can detect edges on an image, namely regions in which the intensity value of a set of pixels along a certain direction changes steeply. These regions contain pixels that are members of real edges but also pixels that are due to noise or blurring. Often what we want to know, and what we want a computer to detect, is the shape, namely the analytical description of the object that has been revealed by the edge detector, such as the slope and intersect of a line. This further step, after the edge detection, is called edge linking and consists of connecting together the pixels that are real members of an edge of a certain shape, avoiding the pixels that have been included due to noise or blurring. The most common shape that can be found in pictures, especially those containing man-made objects, is a line. The problem can be solved by exhaustively by testing all the pixels in the edge regions. However, the computational complexity of such an approach would be proportional to the square of the number of edge pixels. Another approach was suggested in 1962 by Paul Hough, while trying to automatically determine the trajectories of charged particles crossing a bubble chamber.</p>
<p><img src="../assets/mesons.png" alt="Bubble chamber" /></p>
<p>The Hough method was to transform each bubble point $(x_0, y_0)$, represented by a pixel in the image, and the set of all possible lines $y_0 = sx_0 + d$ passing through it, into a line in a parameter space $(s, d)$ whose variable were the slope $s$ and the intercept $d$ with the y axis. If two points in the image belong to the same line, their representations in the parameter space must intersect for a certain value of the slope $s$ and the intercept $d$. We can therefore solve the problem of finding a line that goes through a certain number of pixels in the image by solving the problem of finding the point in the parameter space where the lines that represent each pixels intersect. The more lines intersect in a specific point $(s_0, d_0)$ of the parameter space, the more pixels in the image belong to the same line with slope $s_0$ and intersect $d_0$. The point in the parameter space that lies at the intersection of a high number of lines represents the most “voted” line in the image. Since the linear parametrization is unbounded for vertical or near vertical lines, a different transformation was introduced by <a href="https://www.cse.unr.edu/~bebis/CS474/Handouts/HoughTransformPaper.pdf">Duda and Hart</a> that uses as parameters the orientation angle $\theta$ and the distance $\rho$ from the origin of the coordinates system to represent the set of lines that can pass through a pixel. We can derive the normal form of a line by computing the slope and the intercept in the frame of reference that is commonly used for images where the origin is on the upper left corner, with the y axis pointing downward and the x axis pointing to the right.</p>
<p><img src="../assets/hough_transform.png" alt="Hough Transform" /></p>
<p>From the diagram we can easily derive the expressions of the slope s and the intercept d for the equation of a line $y = sx + d$</p>
\[s = \frac{y_2 - y_1}{x_2 - x_1} = \frac{cos(\theta)}{sin(\theta)}\]
<p>and</p>
\[d = \frac{\rho}{sin(\theta)}\]
<p>so that we can represent the set of lines passing through a pixel at $(x_0, y_0)$ with the expression</p>
\[\rho = -x_0 cos(\theta) + y_0 sin(\theta)\]
<p>With this expression, called normal Hesse form or simply normal form, we can represent the set of lines that pass through a pixel at $(x_0, y_0)$ in the image by a sinusoidal function in the parameter space $(\theta, \rho)$. If two points belong to the same line in the image, their representations as sinusoidal functions in the parameter space must intersect at a certain point $(\theta_0, \rho_0)$. Similarly to what has been said before, the more sinusoidal curves intersect in a point $(\theta_0, \rho_0)$ of the parameter space, the more its corresponding line in the image ranks high enough to be elected as a real line. We can count the number of sinusoidal curves that intersect at each point of the parameter space by dividing this space into a grid of cells whose width and height depends on the angular and spatial resolution of the image. For example, if we can distinguish two lines in the image that are rotated by 1 degree and two lines that are separated by one pixel, we can set the width of each cell in the parameter space $(\theta, \rho)$ as one degree and the height as 1 pixel. In Python we can use a two-dimensional array to store the number of sinusoidal curves that pass through each cell. The 2D array is called accumulator matrix. Once we have processed all the edge pixels, computed the corresponding Hough transform and counted the votes for each cell in the accumulator matrix, we can select the cells that contain the highest number of votes that correspond to straight lines in the image.
In the following section we will see examples of the application of the Hough transform to detect simple geometrical shapes, made up of dotted lines.</p>
<h2 id="images-with-geometrical-shapes">Images with geometrical shapes</h2>
<p>An image in Python is a 2D array in which the intensity values of each pixel are stored. We start by creating an image with shapes composed of lines to test the performances of the Hough algorithm. The first step is to compute the Hough transform, in normal form, for each pixel that belongs to a geometrical shape. The second step is to initialize the accumulator matrix $A$ and, for each pixel that belongs to a shape, mark each cell $A[i_{\rho}, j_{\theta}]$ in the accumulator that is passed by its Hough transform, represented by a sinusoidal curve. In other words, we store in the accumulator the trace of the Hough transform of every edge pixel in the image. The Hough transform returns the quantized values $j_{\theta}$ and $i_{\rho}$ for $\theta$ and $\rho$. We choose the quantization for the angle $\theta$ based on the accuracy of the orientation of a line in the image. We assume the angle $\theta$ lies in the interval $0 \leq \theta \lt 180$ so that the relation between $\theta$ and $\rho$ is one-to-one. If the resolution of our image is good enough that we can distinguish lines whose difference in slope is at least one degree, we can set the increment to 1 degree, or $\frac{\pi}{180}$ radians.
In the same way, we can choose the quantization for the distance $\rho$ of a pixel from the origin. Given an image whose 2D array shape is (M,N), i.e. M rows and N columns, the distance between any two pixels in the image cannot be bigger than the length of the diagonal of the image, therefore $0 \lt \rho \lt \sqrt{M^2 + N^2}$. If the spatial resolution of our image is one pixel, we can set the increment for the distance to 1 pixel as well.
With this quantization we can represent any pixel in an image and the set of lines that pass through it, represented by the parameters $\theta$ and $\rho$ in the parameters space, with the two integer values $j_{\theta}$ and $i_{\rho}$ that can range between 0 and 180 degree and 0 and the length of the diagonal of the image, respectively. The two integer values are used as indexes of the cell $A[i_{\rho}, j_{\theta}]$ that contains the number of votes for the line in the image whose angle with the y axis is $\theta$ and whose distance from the origin is $\rho$.</p>
<h3 id="the-hough-curves">The Hough curves</h3>
<p>As an example, we plot the Hough sinusoidal curves of three aligned pixels, to see that they intersect in one point $(\theta_0, \rho_0)$ of the parameter space that corresponds to the angle $\theta_0$ between the line that passes through them and the y axis, and to the distance $\rho_0$ of the line from the origin.</p>
<p><img src="../assets/hough_lines.png" alt="Hough Curves" /></p>
<p>We can see from the plot that the three sinusoidal curves that represent the three pixels in the parameter space cross each other at 45 degrees and at a distance of approximately 70 pixels. We will see how to use the accumulator matrix to derive both values with the best accuracy possible. The picture can be seen as a snapshot of the accumulator matrix, after the Hough transforms of the three pixels have been determined and stored.</p>
<h3 id="the-accumulator-matrix">The accumulator matrix</h3>
<p>As said before, in Python we can use a 2D array to store the traces of the curves computed for each edge pixels using the Hough transform. We can see that the value of the distance parameter $\rho$ can be negative for certain values of the pixel coordinates and of the angle $\theta$. Since NumPy cannot use negative values for indexes we get the absolute value of the distance. In this way we will be able to store the votes for any point in the parameters space.</p>
<p>We create an image with a triangular shape and then we compute the Hough transform of each pixel that belongs to any of the three lines that form the triangle. We store the number of curves that pass through each accumulator’s cell $A[i_{\rho}, j_{\theta}]$ and, after all the edge pixels are processed, we plot the image and the corresponding accumulator matrix. We notice four points in the Hough transform diagram with the highest values, also called peaks: the point at 135 degrees, that has the highest number of votes, one at 90 degree, that represents the horizontal line in the image, and two other points at 0 and 180 degree that represent the same vertical line in the image. We can extract the peaks from the accumulator matrix by setting a minimum vote threshold and taking only the cells whose value lies above it.</p>
<p><img src="../assets/hough_triangle.png" alt="Triangle Hough Transform" /></p>
<p>After we have got the angle $\theta$ and distance $\rho$ of the peaks in the accumulator matrix, corresponding to the most voted lines in our image, we can compute the respective slopes and intercepts.</p>
<p>We show another example with an image that contains a little more complex figure with two geometrical shapes, the triangle we have already used and a square box. We create the image and compute the accumulator matrix. We plot the image and the detected lines setting the minimum threshold for the cells in the accumulator matrix to 50 votes first and then to 200.</p>
<p><img src="../assets/shapes_transform.png" alt="Shapes Transform" /></p>
<h2 id="images-with-man-made-objects">Images with man-made objects</h2>
<p>Now that we have tested our implementation of the Hough transform with images containing geometrical shapes made up of dotted lines, we are ready to move on to the next step, namely, applying the algorithm to find lines in pictures containing man-made objects. When we use pictures of real objects, before looking for lines or other geometrical shapes, we have to detect the edges that reveal the border of objects or regions in the image. This step was not necessary in the previous examples because the edges of the geometrical shapes were drawn precisely using the equation of a line. Borders separating man-made or natural objects can be found using a thresholding function or an edge detector. Once edges have been detected, the next step is to link their pixels to find out lines for which we can determine the slope and the intercept. We perform the linking step using the Hough transform. We can build a pipeline of functions to find lines in pictures. We can add one more step to our pipeline to take into account the quantization error of the accumulator matrix for which the Hough lines may not intersect exactly in one single cell but more likely in a cluster of neighboring cells. We add a thresholding step after the edge detector to separate precisely the edges from the background. The complete steps that we will perform in the next example are the followings</p>
<ol>
<li>Apply the gradient-based edge detector to an image to get its edge map.</li>
<li>Apply a threshold to the edge map to obtain a binary representation.</li>
<li>Apply the Hough transform to the foreground pixels of the binary edge map to build the accumulator matrix.</li>
<li>Suppress the nonmaximal cells from the accumulator matrix to reduce the quantization error</li>
<li>Set the minimum votes threshold to select the peaks in the accumulator matrix that correspond to straight lines in the image.</li>
<li>Compute the slopes and intercepts of the lines in the image corresponding to the peaks.</li>
<li>Plot the lines on the image.</li>
</ol>
<p>The quantization error can be addressed by suppressing from the accumulator matrix the nonmaximal cells whose value is lower than any of its neighboring cells. A function is defined in the Jupyter notebook to implement the suppression of the nonmaximal cells.</p>
<p>In the next example we use an image of an airport that contains two runways, among other structures. We compute the edge map of the image by applying a gradient filter, and then we create a binary version of the edge map by applying a thresholding function that enhances the separation between the edges and the background. From the binary edge map we can compute the accumulator matrix. We suppress the nonmaximal cells in the accumulator matrix, within a default distance of one pixel from each cell and finally, we select the peaks in the parameter space whose number of votes are above a threshold. The slopes and intercepts corresponding to the peaks are used to plot the detected lines superimposed on the original image.</p>
<p><img src="../assets/runways_hough_transform.png" alt="Runways Hough Transform" /></p>
<p>We can see from the last picture that the Hough transform is able to determine the main lines, with their slopes and intercepts, that correspond to the borders of the runways of the airport. We can also notice that other lines, visible in the binary image, have not been included in the set that resulted from our choice of the vote threshold and neighboring distance. This is mainly due to the fact that those lines are shorter or contain less edge pixels than the two runways. This bias towards longer lines can be addressed, for example by dividing the image in smaller boxes and then applying the Hough transform to each of them, or by finding the pixels that delimit the lines in the binary image and then looking for the corresponding lines in the accumulator matrix.</p>
<h2 id="conclusion">Conclusion</h2>
<p>The Hough transform can be used to extract lines from images with a complexity cost that is linear with respect to the number of edge pixels. We have shown the basic steps that are required to implement the Hough transform for which some manual settings are required, such as the quantization of the parameter space, the votes threshold and the neighboring distance for the accumulator matrix.</p>The Hough transform is used in digital image processing and computer vision to find geometrical shapes such as lines, circles or ellipses, common in images that contain man-made objects. The Hough transform can be used after an image has been pre-processed by an edge detector to find the edges that reveal the border of objects or regions inside it. In this post I will introduce briefly the theory behind the Hough transform, and then I will present two examples , one with images containing simple geometrical shapes, to better explain the idea, and one with an image containing man-made objects. A Jupyter notebook with the Python code used to implement the functions discussed in the post and to derive the pictures shown here is available on my GitHub repository. Introduction In digital image processing different filters are available that can detect edges on an image, namely regions in which the intensity value of a set of pixels along a certain direction changes steeply. These regions contain pixels that are members of real edges but also pixels that are due to noise or blurring. Often what we want to know, and what we want a computer to detect, is the shape, namely the analytical description of the object that has been revealed by the edge detector, such as the slope and intersect of a line. This further step, after the edge detection, is called edge linking and consists of connecting together the pixels that are real members of an edge of a certain shape, avoiding the pixels that have been included due to noise or blurring. The most common shape that can be found in pictures, especially those containing man-made objects, is a line. The problem can be solved by exhaustively by testing all the pixels in the edge regions. However, the computational complexity of such an approach would be proportional to the square of the number of edge pixels. Another approach was suggested in 1962 by Paul Hough, while trying to automatically determine the trajectories of charged particles crossing a bubble chamber. The Hough method was to transform each bubble point $(x_0, y_0)$, represented by a pixel in the image, and the set of all possible lines $y_0 = sx_0 + d$ passing through it, into a line in a parameter space $(s, d)$ whose variable were the slope $s$ and the intercept $d$ with the y axis. If two points in the image belong to the same line, their representations in the parameter space must intersect for a certain value of the slope $s$ and the intercept $d$. We can therefore solve the problem of finding a line that goes through a certain number of pixels in the image by solving the problem of finding the point in the parameter space where the lines that represent each pixels intersect. The more lines intersect in a specific point $(s_0, d_0)$ of the parameter space, the more pixels in the image belong to the same line with slope $s_0$ and intersect $d_0$. The point in the parameter space that lies at the intersection of a high number of lines represents the most “voted” line in the image. Since the linear parametrization is unbounded for vertical or near vertical lines, a different transformation was introduced by Duda and Hart that uses as parameters the orientation angle $\theta$ and the distance $\rho$ from the origin of the coordinates system to represent the set of lines that can pass through a pixel. We can derive the normal form of a line by computing the slope and the intercept in the frame of reference that is commonly used for images where the origin is on the upper left corner, with the y axis pointing downward and the x axis pointing to the right. From the diagram we can easily derive the expressions of the slope s and the intercept d for the equation of a line $y = sx + d$ \[s = \frac{y_2 - y_1}{x_2 - x_1} = \frac{cos(\theta)}{sin(\theta)}\] and \[d = \frac{\rho}{sin(\theta)}\] so that we can represent the set of lines passing through a pixel at $(x_0, y_0)$ with the expression \[\rho = -x_0 cos(\theta) + y_0 sin(\theta)\] With this expression, called normal Hesse form or simply normal form, we can represent the set of lines that pass through a pixel at $(x_0, y_0)$ in the image by a sinusoidal function in the parameter space $(\theta, \rho)$. If two points belong to the same line in the image, their representations as sinusoidal functions in the parameter space must intersect at a certain point $(\theta_0, \rho_0)$. Similarly to what has been said before, the more sinusoidal curves intersect in a point $(\theta_0, \rho_0)$ of the parameter space, the more its corresponding line in the image ranks high enough to be elected as a real line. We can count the number of sinusoidal curves that intersect at each point of the parameter space by dividing this space into a grid of cells whose width and height depends on the angular and spatial resolution of the image. For example, if we can distinguish two lines in the image that are rotated by 1 degree and two lines that are separated by one pixel, we can set the width of each cell in the parameter space $(\theta, \rho)$ as one degree and the height as 1 pixel. In Python we can use a two-dimensional array to store the number of sinusoidal curves that pass through each cell. The 2D array is called accumulator matrix. Once we have processed all the edge pixels, computed the corresponding Hough transform and counted the votes for each cell in the accumulator matrix, we can select the cells that contain the highest number of votes that correspond to straight lines in the image. In the following section we will see examples of the application of the Hough transform to detect simple geometrical shapes, made up of dotted lines. Images with geometrical shapes An image in Python is a 2D array in which the intensity values of each pixel are stored. We start by creating an image with shapes composed of lines to test the performances of the Hough algorithm. The first step is to compute the Hough transform, in normal form, for each pixel that belongs to a geometrical shape. The second step is to initialize the accumulator matrix $A$ and, for each pixel that belongs to a shape, mark each cell $A[i_{\rho}, j_{\theta}]$ in the accumulator that is passed by its Hough transform, represented by a sinusoidal curve. In other words, we store in the accumulator the trace of the Hough transform of every edge pixel in the image. The Hough transform returns the quantized values $j_{\theta}$ and $i_{\rho}$ for $\theta$ and $\rho$. We choose the quantization for the angle $\theta$ based on the accuracy of the orientation of a line in the image. We assume the angle $\theta$ lies in the interval $0 \leq \theta \lt 180$ so that the relation between $\theta$ and $\rho$ is one-to-one. If the resolution of our image is good enough that we can distinguish lines whose difference in slope is at least one degree, we can set the increment to 1 degree, or $\frac{\pi}{180}$ radians. In the same way, we can choose the quantization for the distance $\rho$ of a pixel from the origin. Given an image whose 2D array shape is (M,N), i.e. M rows and N columns, the distance between any two pixels in the image cannot be bigger than the length of the diagonal of the image, therefore $0 \lt \rho \lt \sqrt{M^2 + N^2}$. If the spatial resolution of our image is one pixel, we can set the increment for the distance to 1 pixel as well. With this quantization we can represent any pixel in an image and the set of lines that pass through it, represented by the parameters $\theta$ and $\rho$ in the parameters space, with the two integer values $j_{\theta}$ and $i_{\rho}$ that can range between 0 and 180 degree and 0 and the length of the diagonal of the image, respectively. The two integer values are used as indexes of the cell $A[i_{\rho}, j_{\theta}]$ that contains the number of votes for the line in the image whose angle with the y axis is $\theta$ and whose distance from the origin is $\rho$. The Hough curves As an example, we plot the Hough sinusoidal curves of three aligned pixels, to see that they intersect in one point $(\theta_0, \rho_0)$ of the parameter space that corresponds to the angle $\theta_0$ between the line that passes through them and the y axis, and to the distance $\rho_0$ of the line from the origin. We can see from the plot that the three sinusoidal curves that represent the three pixels in the parameter space cross each other at 45 degrees and at a distance of approximately 70 pixels. We will see how to use the accumulator matrix to derive both values with the best accuracy possible. The picture can be seen as a snapshot of the accumulator matrix, after the Hough transforms of the three pixels have been determined and stored. The accumulator matrix As said before, in Python we can use a 2D array to store the traces of the curves computed for each edge pixels using the Hough transform. We can see that the value of the distance parameter $\rho$ can be negative for certain values of the pixel coordinates and of the angle $\theta$. Since NumPy cannot use negative values for indexes we get the absolute value of the distance. In this way we will be able to store the votes for any point in the parameters space. We create an image with a triangular shape and then we compute the Hough transform of each pixel that belongs to any of the three lines that form the triangle. We store the number of curves that pass through each accumulator’s cell $A[i_{\rho}, j_{\theta}]$ and, after all the edge pixels are processed, we plot the image and the corresponding accumulator matrix. We notice four points in the Hough transform diagram with the highest values, also called peaks: the point at 135 degrees, that has the highest number of votes, one at 90 degree, that represents the horizontal line in the image, and two other points at 0 and 180 degree that represent the same vertical line in the image. We can extract the peaks from the accumulator matrix by setting a minimum vote threshold and taking only the cells whose value lies above it. After we have got the angle $\theta$ and distance $\rho$ of the peaks in the accumulator matrix, corresponding to the most voted lines in our image, we can compute the respective slopes and intercepts. We show another example with an image that contains a little more complex figure with two geometrical shapes, the triangle we have already used and a square box. We create the image and compute the accumulator matrix. We plot the image and the detected lines setting the minimum threshold for the cells in the accumulator matrix to 50 votes first and then to 200. Images with man-made objects Now that we have tested our implementation of the Hough transform with images containing geometrical shapes made up of dotted lines, we are ready to move on to the next step, namely, applying the algorithm to find lines in pictures containing man-made objects. When we use pictures of real objects, before looking for lines or other geometrical shapes, we have to detect the edges that reveal the border of objects or regions in the image. This step was not necessary in the previous examples because the edges of the geometrical shapes were drawn precisely using the equation of a line. Borders separating man-made or natural objects can be found using a thresholding function or an edge detector. Once edges have been detected, the next step is to link their pixels to find out lines for which we can determine the slope and the intercept. We perform the linking step using the Hough transform. We can build a pipeline of functions to find lines in pictures. We can add one more step to our pipeline to take into account the quantization error of the accumulator matrix for which the Hough lines may not intersect exactly in one single cell but more likely in a cluster of neighboring cells. We add a thresholding step after the edge detector to separate precisely the edges from the background. The complete steps that we will perform in the next example are the followings Apply the gradient-based edge detector to an image to get its edge map. Apply a threshold to the edge map to obtain a binary representation. Apply the Hough transform to the foreground pixels of the binary edge map to build the accumulator matrix. Suppress the nonmaximal cells from the accumulator matrix to reduce the quantization error Set the minimum votes threshold to select the peaks in the accumulator matrix that correspond to straight lines in the image. Compute the slopes and intercepts of the lines in the image corresponding to the peaks. Plot the lines on the image. The quantization error can be addressed by suppressing from the accumulator matrix the nonmaximal cells whose value is lower than any of its neighboring cells. A function is defined in the Jupyter notebook to implement the suppression of the nonmaximal cells. In the next example we use an image of an airport that contains two runways, among other structures. We compute the edge map of the image by applying a gradient filter, and then we create a binary version of the edge map by applying a thresholding function that enhances the separation between the edges and the background. From the binary edge map we can compute the accumulator matrix. We suppress the nonmaximal cells in the accumulator matrix, within a default distance of one pixel from each cell and finally, we select the peaks in the parameter space whose number of votes are above a threshold. The slopes and intercepts corresponding to the peaks are used to plot the detected lines superimposed on the original image. We can see from the last picture that the Hough transform is able to determine the main lines, with their slopes and intercepts, that correspond to the borders of the runways of the airport. We can also notice that other lines, visible in the binary image, have not been included in the set that resulted from our choice of the vote threshold and neighboring distance. This is mainly due to the fact that those lines are shorter or contain less edge pixels than the two runways. This bias towards longer lines can be addressed, for example by dividing the image in smaller boxes and then applying the Hough transform to each of them, or by finding the pixels that delimit the lines in the binary image and then looking for the corresponding lines in the accumulator matrix. Conclusion The Hough transform can be used to extract lines from images with a complexity cost that is linear with respect to the number of edge pixels. We have shown the basic steps that are required to implement the Hough transform for which some manual settings are required, such as the quantization of the parameter space, the votes threshold and the neighboring distance for the accumulator matrix.Air Quality Forecasts2020-12-08T00:00:00+00:002020-12-08T00:00:00+00:00/copernicus/air_qualty<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {
inlineMath: [['$','$'], ['\\(','\\)']],
processEscapes: true
}
});
</script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script>
<p>During a pandemic the air quality might not be seen as the most pressing issue but, according to the World Health Organization (<a href="https://www.who.int/health-topics/air-pollution#tab=tab_1">WHO</a>), air pollution kills an estimated 7 million people world wide every year. If we wonder what might be the situation in Europe, the European Environment Agency (<a href="https://www.eea.europa.eu/themes/air/health-impacts-of-air-pollution">EEA</a>) would inform us that in 2018 nearly 379000 premature deaths were attributable to particulate matter (PM2.5) and 54000 to NO2. The WHO has established some <a href="https://www.who.int/news-room/fact-sheets/detail/ambient-(outdoor)-air-quality-and-health">guidelines</a> on the quantity of certain pollutants above which an impact on the health of the population is expected. For example PM2.5 should not exceed 10 \(\mu g/m^3\) and NO2 should not exceed an annual mean value of 40 \(\mu g/m^3\) or a 1-hour mean of 200 \(\mu g/m^3\). Given the relevance of these pollutants for the population’s health, every country and region in Europe has built a network of ground stations to monitor them, like those managed by the Agenzia Regionale Protezione Ambientale del Lazio (<a href="https://qa.arpalazio.net/">ARPA</a>) for the Latium region in Italy. The data from the ground stations in Europe are collected by the EEA and assimilated by the European Centre for Medium-Range Weather Forecasts (ECMWF) and other eight institutions across Europe into their air quality models for analysis and forecasts. The numerical models are able to provide an estimate of the concentration of the pollutants in areas that are far from the ground stations with a spatial resolution of 10 km. The ECMWF provides through its Copernicus Atmosphere Monitoring Service (CAMS) the analysis and forecasts about the air quality as the median of the <a href="https://confluence.ecmwf.int/display/CKB/CAMS+Regional%3A+European+air+quality+analysis+and+forecast+data+documentation">ensemble</a> of the models. The data about gas traces such as NOx, Ozone and SO2 are now available also from satellites, e.g. Sentinel 5-p through its sensor <a href="http://www.tropomi.eu/">TROPOMI</a> and MetOp through <a href="https://atmos.eoc.dlr.de/app/missions/gome2">GOME-2</a> but in this post I want to show how the model forecasts from the CAMS service can be fetched by everyone and visualized in an animation on a web page.</p>
<h2 id="the-copernicus-atmosphere-monitoring-service">The Copernicus Atmosphere Monitoring Service</h2>
<p>The Copernicus Atmosphere Monitoring Service (<a href="https://atmosphere.copernicus.eu/">CAMS</a>) provides information related to air quality, atmospheric composition, greenhouse gases, solar irradiance. The datasets released by the CAMS service is the result of assimilation processes in which observations from satellites and ground stations are used to update and correct every hour the estimates computed by a numerical model of the atmosphere. The CAMS is operated by the European Centre for Medium-Range Weather Forecasts (<a href="https://www.ecmwf.int/">ECMWF</a>) on behalf of the European Commission.</p>
<h3 id="cams-european-air-quality-forecasts">CAMS European air quality forecasts</h3>
<p>The CAMS provides its datasets as open data, available to all for free, through a web page and a web service. For the <a href="https://ads.atmosphere.copernicus.eu/cdsapp#!/dataset/cams-europe-air-quality-forecasts">European air quality forecasts</a> a user can select, among other options</p>
<ul>
<li>the variables, that is the physical parameters she is interested in</li>
<li>the model she wants to use (nine models are available, plus the ensemble)</li>
<li>the levels, or heights for which she wants the forecasts</li>
<li>the area of interest, delimited by north and south latitudes and west and east longitudes</li>
<li>the dates from which the forecasts should start</li>
<li>the lead time hours, that is the hours of the forecasts</li>
<li>the format of the data (CSV, GRIB, NetCDF)</li>
</ul>
<p>If you want to set up a service based on the data provided by the CAMS you will likely use the web service API. In order to use the API, you have to be registered into the Copernicus Atmosphere Data Store (<a href="https://ads.atmosphere.copernicus.eu/#!/home">CADS</a>) and follow these steps</p>
<ol>
<li>login</li>
<li>copy your ADS API key in the .condarc file in your home folder</li>
<li>install the cdsapi Python package</li>
</ol>
<p>For more information on how to register to the CAMS follow the <a href="https://ads.atmosphere.copernicus.eu/api-how-to">how-to instructions</a>. The NetCDF format is easy to use in Python with the xarray library. If you choose one single day as the start of the forecasts the file returned by the service will contain the forecasts in a multidimensional array. The name of the dimensions (metadata) are standardized and a description is provided in the file. We can easily visualize the data for one forecast using Matplotlib. Here is an image of the concentration of \(NO_2\) at ground level where we can clearly see that the concentration will go well beyond the annual mean limit recommended by the WHO in most of the Po Valley.</p>
<p><img src="/assets/no2_forecast_example.png" alt="Forecast example" /></p>
<h2 id="the-jupyter-notebooks">The Jupyter Notebooks</h2>
<p>If we have requested the forecasts at different lead time hours we might want to visualize each forecast as a frame in an animation. The details on how to create an animation with Matplotlib are a little bit more convoluted but a Jupyter notebook is available on my Github repository with an <a href="https://nbviewer.jupyter.org/github/luigiselmi/datascience/blob/master/python/copernicus/air_quality_forecasts.ipynb">example of an animation</a> built on a sequence of hourly forecasts for 4 consecutive days. The Python code can be used to fetch and visualize other variables with a little work and a different selection of the request’s parameters.</p>
<h2 id="conclusion">Conclusion</h2>
<p>I find the Jupyter notebooks a great tool to test ideas and publish the results of experiments. The Jupyter notebooks are used by EUMETSAT for their highly recommended <a href="https://training.eumetsat.int/">training courses</a>. <a href="https://www.wekeo.eu/">Wekeo</a>, one of the Copernicus Data and Information Access Services (DIAS), provides virtual machines with a Jupyter Lab instance that can be used to write notebooks without the need to download the data. While the Copernicus data are free the computing resources offered by the Copernicus DIAS are not but it might be the only way to build a service in case you need to consume data that is updated frequently and whose size can range from hundreds of MBytes to GBytes.</p>During a pandemic the air quality might not be seen as the most pressing issue but, according to the World Health Organization (WHO), air pollution kills an estimated 7 million people world wide every year. If we wonder what might be the situation in Europe, the European Environment Agency (EEA) would inform us that in 2018 nearly 379000 premature deaths were attributable to particulate matter (PM2.5) and 54000 to NO2. The WHO has established some guidelines on the quantity of certain pollutants above which an impact on the health of the population is expected. For example PM2.5 should not exceed 10 \(\mu g/m^3\) and NO2 should not exceed an annual mean value of 40 \(\mu g/m^3\) or a 1-hour mean of 200 \(\mu g/m^3\). Given the relevance of these pollutants for the population’s health, every country and region in Europe has built a network of ground stations to monitor them, like those managed by the Agenzia Regionale Protezione Ambientale del Lazio (ARPA) for the Latium region in Italy. The data from the ground stations in Europe are collected by the EEA and assimilated by the European Centre for Medium-Range Weather Forecasts (ECMWF) and other eight institutions across Europe into their air quality models for analysis and forecasts. The numerical models are able to provide an estimate of the concentration of the pollutants in areas that are far from the ground stations with a spatial resolution of 10 km. The ECMWF provides through its Copernicus Atmosphere Monitoring Service (CAMS) the analysis and forecasts about the air quality as the median of the ensemble of the models. The data about gas traces such as NOx, Ozone and SO2 are now available also from satellites, e.g. Sentinel 5-p through its sensor TROPOMI and MetOp through GOME-2 but in this post I want to show how the model forecasts from the CAMS service can be fetched by everyone and visualized in an animation on a web page. The Copernicus Atmosphere Monitoring Service The Copernicus Atmosphere Monitoring Service (CAMS) provides information related to air quality, atmospheric composition, greenhouse gases, solar irradiance. The datasets released by the CAMS service is the result of assimilation processes in which observations from satellites and ground stations are used to update and correct every hour the estimates computed by a numerical model of the atmosphere. The CAMS is operated by the European Centre for Medium-Range Weather Forecasts (ECMWF) on behalf of the European Commission. CAMS European air quality forecasts The CAMS provides its datasets as open data, available to all for free, through a web page and a web service. For the European air quality forecasts a user can select, among other options the variables, that is the physical parameters she is interested in the model she wants to use (nine models are available, plus the ensemble) the levels, or heights for which she wants the forecasts the area of interest, delimited by north and south latitudes and west and east longitudes the dates from which the forecasts should start the lead time hours, that is the hours of the forecasts the format of the data (CSV, GRIB, NetCDF) If you want to set up a service based on the data provided by the CAMS you will likely use the web service API. In order to use the API, you have to be registered into the Copernicus Atmosphere Data Store (CADS) and follow these steps login copy your ADS API key in the .condarc file in your home folder install the cdsapi Python package For more information on how to register to the CAMS follow the how-to instructions. The NetCDF format is easy to use in Python with the xarray library. If you choose one single day as the start of the forecasts the file returned by the service will contain the forecasts in a multidimensional array. The name of the dimensions (metadata) are standardized and a description is provided in the file. We can easily visualize the data for one forecast using Matplotlib. Here is an image of the concentration of \(NO_2\) at ground level where we can clearly see that the concentration will go well beyond the annual mean limit recommended by the WHO in most of the Po Valley. The Jupyter Notebooks If we have requested the forecasts at different lead time hours we might want to visualize each forecast as a frame in an animation. The details on how to create an animation with Matplotlib are a little bit more convoluted but a Jupyter notebook is available on my Github repository with an example of an animation built on a sequence of hourly forecasts for 4 consecutive days. The Python code can be used to fetch and visualize other variables with a little work and a different selection of the request’s parameters. Conclusion I find the Jupyter notebooks a great tool to test ideas and publish the results of experiments. The Jupyter notebooks are used by EUMETSAT for their highly recommended training courses. Wekeo, one of the Copernicus Data and Information Access Services (DIAS), provides virtual machines with a Jupyter Lab instance that can be used to write notebooks without the need to download the data. While the Copernicus data are free the computing resources offered by the Copernicus DIAS are not but it might be the only way to build a service in case you need to consume data that is updated frequently and whose size can range from hundreds of MBytes to GBytes.Public-key cryptography and digital signature using OpenSSL2018-12-27T00:00:00+00:002018-12-27T00:00:00+00:00/security,/cryptography/2018/12/27/message-encryption-and-signature<p>The purpose of this post is to explain how to communicate privately over the Internet using public-key cryptography and how to digitally sign a document.</p>
<h2 id="introduction">Introduction</h2>
<p>Being able to communicate privately is a civil right and often a business need. As we can not allow anyone to eavesdrop our communications, we have also the right to avoid surveillance by companies or governments. There are many tools and protocols, many being open source and free, that can be used to enhance the security of our communications over the Internet. The aim of this post is to provide a very high level description of the ideas behind these tools and protocols and practical guidance on how to use one of them, <a href="https://www.openssl.org/">OpenSSL</a>, which is open source, free and used to secure most of the communications over the Internet. In particular in this post we will show</p>
<ol>
<li>How to avoid being eavesdropped while sending files to our friends or collaborators over the internet</li>
<li>How to digitally sign a document</li>
</ol>
<p>It is supposed that you are using a Linux distribution or a Mac with OpenSSL version 1.0.2 installed. In case you use Windows you might want to install <a href="https://www.cygwin.com/">Cygwin</a> with openssl. It is assumed that you know how to use the command line.</p>
<h2 id="alice-and-bob">Alice and Bob</h2>
<p>We will set up a context for the secure communication problem using two characters, Alice and Bob. We will simulate the transmission of encrypted messages between Alice and Bob by copying files from Alice’s folder to Bob’s and vice-versa on our local file system. This simulation is meant for you to easily check what happens on both sides when they send or receive messages using OpenSSL, but it must be kept in mind that it bypasses the core business of encryption that is about sending messages over an insecure channel such as the Internet where other parties could eavesdrop or interfere with Alice’s and Bob’s communication. With this warning in mind, let’s start our simulation by creating a folder for Alice’s messages and one for Bob’s</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">$ mkdir alice
$ mkdir bob</code></pre></figure>
<p>Let’s imagine that Bob can’t remember his bank account details and asks Alice to send them to him by email. Alice is aware that sending the data as plain text over the Internet is risky so she wonders how to send the data to Bob in such a way that nobody else but he can read and use the data. After some investigation, Alice decides that the solution to their problem is public-key cryptography and the OpenSSL tools.</p>
<h2 id="public-key-cryptography">Public-key cryptography</h2>
<p>Public-key cryptography consists of creating a key pair, namely a private key and a public key, to encrypt and decrypt messages. The private key is kept secret and is never shared with anyone. Alice uses Bob’s public key to encrypt the messages being sent to him. Bob uses his private key to decrypt the messages encrypted with his public key. The public key can even be published on the Internet for that matter. Only the owner of the private key can decrypt a message encrypted with his or her public key. There are different ways of creating a key pair but all are based on defining mathematical problems that are very difficult to solve in a short time scale, such as factorizing a number that is the product of two big prime numbers. This class of problems is used in the Rivest-Shamir-Adleman (RSA) cryptosystem. The idea is to find two prime numbers big enough, e.g. with more than 150 digits, so that it would be very difficult even for a cluster of computers to find them out in decades while it is very easy to compute their product. In RSA, the public key is the product of two prime numbers and the private key is the set of the two prime numbers themselves. An eavesdropper who wants to decrypt a message would need to extract the private key, i.e. the two prime numbers, from their product. In other words, the eavesdropper must be able to factorize a number that is the product of two big prime numbers, which in itself is an hard enough problem. Using RSA we can be confident that nobody will be able to decrypt our messages. The algorithm used for the encryption is well known and publicly available. The only thing that is not public, and known only to the owner of the key pair, is the private key. Let’s see what Alice and Bob have to do to keep their communication private:</p>
<ol>
<li>Alice and Bob create their own private and public keys.</li>
<li>Bob sends Alice his public key.</li>
<li>Alice encrypts her message using Bob’s public key and sends it to Bob.</li>
<li>Bob decrypts Alice’s message using his private key.</li>
</ol>
<p>So, first of all, both Alice and Bob need a key pair.</p>
<h3 id="1-alice-and-bob-create-their-own-private-and-public-keys">1. Alice and Bob create their own private and public keys.</h3>
<p>Alice doesn’t yet have a key pair, so she needs to create it. As an example she may use the RSA cryptosystem. Her private key will be stored in a file, e.g. alice_rsa. The size of the private key will be 2048 bit. Let’s move into Alice’s folder and execute the command</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>openssl genpkey <span class="nt">-algorithm</span> RSA <span class="nt">-out</span> alice_rsa <span class="nt">-pkeyopt</span> rsa_keygen_bits:2048</code></pre></figure>
<p>The private key in alice_rsa is saved in the Privacy-Enhanced Mail (PEM) format and looks like the following</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">$ cat alice_rsa
-----BEGIN PRIVATE KEY-----
MIIEvgIBADANBgkqhkiG9w0BAQEFAASCBKgwggSkAgEAAoIBAQC68nDsjtWepLcM
pF4zVaMdFsVdg692M5Mj9v/vGvgyyPHpHmH/QKolOB9KtlUZcth6d7fwmFgyaa/m
XN1HjORKrpzm0rysPpFXJymUmIGy9XLzvPP4phJS3oGsnjsQJ6O017uWt8kqgz5U
U+hYc1/AOUCA9vEP1AN+6fT6O11zZQ75aeJOK6aESV/++7ZaM4M8maVrCKFhonBU
L9ByHNDkgQMWIu2iGazb7FZ5xDHWq+wBpfJZe8/3LNjS7VpUNeqoCsBuUE4fWHYo
TSPK3CmhpWHtk9EnMxyho/rpt6/ETYPOQ3QV6Uxz4G9tDLpJzgL2Q4VHVKwqJSnT
BLrvVOnxAgMBAAECggEBAK0VbDHIqMVh0Ux2HfU/U27KN182Xcx9Qbzpodm5yZQT
cc4Y4DhYoW8mP+qHV9DhAMaacwXhtr6uFTqePg1Rx8fRVNlswVxj7WKYkqnObT7I
e25pQiSzdYGeGsc8FIkHek0j870+WZTvwFSI/zRtVXh+SVddyqCR9c6aQ8MuFX6Q
u9LzGNcYTg6Dmv8qsrXlctkJRvLOuKajaAG3AHT6f5GTXRDDhk9/Ab4h9Dorkgen
Z0fg9yLfvlltO6z3z99VHIMlWX4TZ67kGP7L+AqPpzN9Qj9G15h2Blb6InlB5J/q
pgKUVGGVC0TTILLjXLUT3xEvrWvpqHIAj80Xejuz0XUCgYEA5CAIZdKyqTrfNKTg
B7ZxIX23R/YoqU6SGYDJ/8mz8Z+0PdlCrFb/fTvb2e8aQWSfYTccwhpPH9ZGQ2zL
hXImNxPeVPymVYojZ8XQwwsd2KoK1jkuLr7uzcs393P5PB0YoCzB4kvsnlozcyrm
+lTz913eHd6BsXoRx8GeRPs3nx8CgYEA0cpRr8+EBYgWCgJ71cpB8BLZ+kZErz5W
p9GO5CD/wqaJ+Ljzrvr+XmbCFzDaf/KPTcYeFD7bz9aYq3SSavvnNSQQsHhjuphb
CE40eR/fLwubyYhjOdHXjdYrxsI2gF7FyOO25PWx2OLoCqZITBYlaOdgxefQNtlM
1boATrYZJO8CgYBCuX/bUIqDZz3cJxGED//9HMlcGgsAooOnQ/1RfMzOMrlEkeSn
hfbKyZRfpUkXsXfQto8J0yorlMAOfqb0zFOTLpOMZi28vV/nvXt3YSwEsI/k4uq4
L46n0PX4wgo3ZAdM6mp3Z1+5XYbI+9Z9iBWn1+Pc9rUWlS7YL7C8WoKFXwKBgCmI
w7lp/TpXIf3jVf8SpxFPuiYpqUmErwVUoNSbj+dKr4A1pdEb0iaAc6bBvlCchjCg
q63YcA5q7xjq4F4b9z93H3LAswXrSgKP8SWV4Mrgonw462Q0HlfvcgVMyBuMJ95I
7xnPZuGIsuYA28lsjQWC4Y7tATUKuoKJ66ups7qzAoGBAJyKVY2ZqpkEHlzMixnk
BBKZA9sccokOYWtVtnCxWZYnnG7ElOBvojuLtf+/stvIadnCVe7km6f6J50QcqtH
1g6eTMfEoqkXG5plBlcEbjEv+wAGO9RXCiyYNquUuwjMrgv8dqUpHGXdw6XxxGi6
LTf0HIwHOpMNVVyptpRZoCH/
-----END PRIVATE KEY-----</code></pre></figure>
<p>The public key can be created from the private one, and saved in e.g. alice_rsa.pub, with the command</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>openssl rsa <span class="nt">-in</span> alice_rsa <span class="nt">-pubout</span> <span class="nt">-out</span> alice_rsa.pub</code></pre></figure>
<p>Alice’s public key will look like</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">$ cat alice_rsa.pub
-----BEGIN PUBLIC KEY-----
MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAuvJw7I7VnqS3DKReM1Wj
HRbFXYOvdjOTI/b/7xr4Msjx6R5h/0CqJTgfSrZVGXLYene38JhYMmmv5lzdR4zk
Sq6c5tK8rD6RVycplJiBsvVy87zz+KYSUt6BrJ47ECejtNe7lrfJKoM+VFPoWHNf
wDlAgPbxD9QDfun0+jtdc2UO+WniTiumhElf/vu2WjODPJmlawihYaJwVC/QchzQ
5IEDFiLtohms2+xWecQx1qvsAaXyWXvP9yzY0u1aVDXqqArAblBOH1h2KE0jytwp
oaVh7ZPRJzMcoaP66bevxE2DzkN0FelMc+BvbQy6Sc4C9kOFR1SsKiUp0wS671Tp
8QIDAQAB
-----END PUBLIC KEY-----</code></pre></figure>
<p>Now we have Alice’s key pair in her folder. Let’s do the same for Bob. We move into Bob’s folder and create his key pair, stored in e.g. bob_rsa and bob_rsa.pub, as we did for Alice. After Alice and Bob have their key pair we are done with the 1st step of the procedure.</p>
<h3 id="2-bob-sends-alice-his-public-key">2. Bob sends Alice his public key.</h3>
<p>Let’s move to the 2nd step: Bob must send his public key to Alice so she will be able to send him her message encrypted. We simulate this by copying Bob’s public key file, bob_rsa.pub, in Alice’s folder. From Bob’s folder</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span><span class="nb">cp </span>bob_rsa.pub ../alice/</code></pre></figure>
<p>As soon as a copy of Bob’s public key is in Alice’s folder, the 2nd step of the procedure is complete and we can move to the 3rd: Alice will encrypt her message using Bob’s public key and will send it to Bob.</p>
<h3 id="3-alice-encrypts-her-message-using-bobs-public-key-and-sends-it-to-bob">3. Alice encrypts her message using Bob’s public key and sends it to Bob.</h3>
<p>Bob’s public key can now be used by Alice with OpenSSL to encrypt her message stored in a file, e.g. data.txt, containing sensitive information</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash">Bob Bank Account
userid: 123456
password: 276f8%2<span class="o">=</span>0as<span class="o">}</span>
pin: 4657</code></pre></figure>
<p>In our example the size of the file is only 65 bytes. Alice encrypts the file using OpenSSL and Bob’s public key that she has received from him, e.g. by email, which we have simulated by simply copying the file from Bob’s folder to Alice’s. From Alice’s folder</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>openssl rsautl <span class="nt">-encrypt</span> <span class="nt">-pubin</span> <span class="nt">-inkey</span> bob_rsa.pub <span class="nt">-in</span> data.txt <span class="nt">-out</span> data.txt.enc</code></pre></figure>
<p>Now Alice can send her encrypted message, data.txt.enc. The encrypted message is a binary file whose content doesn’t make any sense and can be decrypted only by Bob using his private key. The RSA encryption algorithm is randomized, and executing again the same command will result in a different ciphertext but when they are decrypted the output will be exactly the same message. If Alice were a real person she would be able to send it to Bob by email. We will once again simulate the sending of the encrypted message by copying it in Bob’s folder. From Alice’s folder</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span><span class="nb">cp </span>data.txt.enc ../bob/</code></pre></figure>
<p>As soon as the encrypted message has been received by Bob, in our simulation when it has been copied in Bob’s folder, the 3rd step is complete. We can move to the 4th and last step.</p>
<h3 id="4-bob-decrypts-alices-message-using-his-private-key">4. Bob decrypts Alice’s message using his private key.</h3>
<p>From Bob’s folder</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>openssl rsautl <span class="nt">-decrypt</span> <span class="nt">-inkey</span> bob_rsa <span class="nt">-in</span> data.txt.enc <span class="nt">-out</span> data.txt</code></pre></figure>
<p>Bob can open the file data.txt containing the original message in plain text that Alice wanted to send to him. We can easily verify that Bob’s decrypted message and Alice’s original message are exactly the same. From the root folder</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>diff <span class="nt">-s</span> alice/data.txt bob/data.txt
Files alice/data.txt and bob/data.txt are identical</code></pre></figure>
<p>The procedure that Alice chose to send her message to Bob, without risking anyone else reading it, is complete. In this example Alice did not use her private or public key. In case Bob wanted to send her feedback, he could use Alice’s public key to encrypt his message, so that only she would be able to decrypt it, using her private key. Both Alice and Bob must keep their private keys in a very safe place. The private key we have just created for them can be used by anyone who has access to it. One way to protect the private key is to encrypt it using an algorithm, e.g. AES-256, with a password so that only the person who knows the password can decrypt the private key and use it. For example, Alice could have made her private key safer by creating it with the following command</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>openssl genpkey <span class="nt">-algorithm</span> RSA <span class="nt">-out</span> alice_rsa <span class="nt">-pkeyopt</span> rsa_keygen_bits:2048 <span class="nt">-aes-256-cbc</span> <span class="nt">-pass</span> pass:wT16pB9y</code></pre></figure>
<p>where wT16pB9y would be Alice’s password. Currently OpenSSL supports only alphanumeric characters for passwords.</p>
<h2 id="hybrid-cryptosystem">Hybrid cryptosystem</h2>
<p>Alice has successfully solved Bob’s problem. She has been able to send him his bank account details in a secure way. Now she wants to send Bob a file, e.g. a jpeg picture that she doesn’t want anyone else to see, and whose size is some KB. Let’s try to encrypt the image on behalf of Alice</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>openssl rsautl <span class="nt">-encrypt</span> <span class="nt">-pubin</span> <span class="nt">-inkey</span> bob_rsa.pub <span class="nt">-in</span> alice.jpg <span class="nt">-out</span> alice.jpg.enc</code></pre></figure>
<p>This time OpenSSL will raise an error</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">RSA operation error
4294956672:error:0406D06E:,rsa routines:RSA_padding_add_PKCS1_type_2:data too large for key size:rsa_pk1.c:174:</code></pre></figure>
<p>The problem is that the RSA algorithm can be used only to encrypt messages whose size is smaller than the size of the private key that corresponds to the public key used for the encryption. Since Bob’s private key is 2048 bit long, or 256 bytes, his public key cannot be used to encrypt messages that are bigger than 256 bytes. The best option to solve this issue is to use a symmetric algorithm. A symmetric algorithm can use only one key, called a symmetric key, for encryption and decryption. Once a message has been encrypted with the symmetric key, it can be sent, with the symmetric key encrypted using the public key of the recipient, so he or she will be able to decrypt the message. One more reason to use a symmetric algorithm to encrypt a message is that they are three orders of magnitude faster than asymmetric ones. The algorithms used in the symmetric key encryption are different from those used in public-key encryption. The symmetric key algorithms use a key that is based on a pseudo-random value taken from a huge range of possible values. The key is shared only by the two communicating parties. The strength of the algorithm rests in the difficulty of finding the key within a huge key space. The way in which the symmetric key must be created depends on the cryptographic algorithm, also called cipher. One of the most robust ciphers is AES-256, that we have already used to encrypt Alice’s private key. OpenSSL creates the symmetric key, to be used with the AES-256 cipher, from a secret string, in short secret, that can be created and stored in a file. Alice defines a new protocol in which she will create the secret that she will use to encrypt her picture and that she will share with Bob. The system that she is going to use is called a hybrid cryptosystem because it uses public-key and symmetric cryptography together.</p>
<ol>
<li>Alice creates the secret.</li>
<li>Alice encrypts the data using the AES-256 cipher and the secret.</li>
<li>Alice encrypts the secret using Bob’s public key.</li>
<li>Alice sends the encrypted data and the encrypted secret to Bob.</li>
<li>Bob decrypts the secret using his private key.</li>
<li>Bob decrypts the data using the AES-256 cipher and the secret.</li>
</ol>
<p>Let’s implement these steps on behalf of Alice and Bob using OpenSSL.</p>
<h3 id="1-alice-creates-the-secret">1. Alice creates the secret.</h3>
<p>First, Alice creates a secret, e.g. a sequence of 32 random bytes, using a pseudo-random bytes generator provided by OpenSSL. The longer is the sequence of random bytes the more difficult is for an eavesdropper to figure it out. It should never be so short that it could be found simply by brute force. For example a sequence of only 2 bytes (16 bits) can be found in just 2^16 = 65536 attempts. From Alice’s folder</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>openssl rand 32 <span class="nt">-out</span> secret</code></pre></figure>
<h3 id="2-alice-encrypts-the-data-using-the-aes-256-cipher-and-the-secret">2. Alice encrypts the data using the AES-256 cipher and the secret.</h3>
<p>The command will encrypt the image</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>openssl enc <span class="nt">-e</span> <span class="nt">-aes-256-cbc</span> <span class="nt">-in</span> alice.jpg <span class="nt">-out</span> alice.jpg.enc <span class="nt">-pass</span> file:secret <span class="nt">-p</span></code></pre></figure>
<p>and will print the key created by OpenSSL from the secret</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">salt</span><span class="o">=</span>469950DBF6FA435A
<span class="nv">key</span><span class="o">=</span>E94C0C70A8BC662DB270C57B642C010910B65118C97DD37088E84F6DC3627225
iv <span class="o">=</span>B83AB9A6A80D67DFA6B3572EB850EE0D</code></pre></figure>
<p>The AES-256 cipher is a block cipher that encrypts a fixed block of 128 bits of the message at a time with a 256 bits long key. The mode of operation used in the example is Cipher Block Chaining (CBC). The CBC operation mode is a scheme that allows the use of a block cipher to encode strings longer than the block size (16 bytes). The key, created by OpenSSL from the secret, is shown as a result of the encryption with other random parameters, salt and iv. The salt parameter is created by OpenSSL to be added as a suffix to the secret to mitigate directory attacks, when the secret has not been chosen wisely and it could easily be found simply by brute force attack. The iv parameter is the initialization vector used as the content of the first block. It ensures that no information can be extracted by an attacker from messages that may start with some common header. All the parameters, key, salt and iv, are recreated newly every time the command is executed, even if the file and the secret are the same.</p>
<h3 id="3-alice-encrypts-the-secret-using-bobs-public-key">3. Alice encrypts the secret using Bob’s public key.</h3>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>openssl rsautl <span class="nt">-encrypt</span> <span class="nt">-pubin</span> <span class="nt">-inkey</span> bob_rsa.pub <span class="nt">-in</span> secret <span class="nt">-out</span> secret.enc</code></pre></figure>
<h3 id="4-alice-sends-the-encrypted-data-and-the-encrypted-secret-to-bob">4. Alice sends the encrypted data and the encrypted secret to Bob.</h3>
<p>We can simulate the sending of the encrypted data and secret by copying them from Alice’s folder to Bob’s.</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span><span class="nb">cp </span>alice.jpg.enc secret.enc ../bob</code></pre></figure>
<h3 id="5-bob-decrypts-the-secret-using-his-private-key">5. Bob decrypts the secret using his private key.</h3>
<p>From Bob’s folder</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>openssl rsautl <span class="nt">-decrypt</span> <span class="nt">-inkey</span> bob_rsa <span class="nt">-in</span> secret.enc <span class="nt">-out</span> secret</code></pre></figure>
<h3 id="6-bob-decrypts-the-data-using-the-aes-256-cipher-and-the-secret">6. Bob decrypts the data using the AES-256 cipher and the secret.</h3>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>openssl enc <span class="nt">-d</span> <span class="nt">-aes-256-cbc</span> <span class="nt">-in</span> alice.jpg.enc <span class="nt">-out</span> alice.jpg <span class="nt">-pass</span> file:secret <span class="nt">-p</span>
<span class="nv">salt</span><span class="o">=</span>469950DBF6FA435A
<span class="nv">key</span><span class="o">=</span>E94C0C70A8BC662DB270C57B642C010910B65118C97DD37088E84F6DC3627225
iv <span class="o">=</span>B83AB9A6A80D67DFA6B3572EB850EE0D</code></pre></figure>
<p>You can verify that the image in Bob’s folder is exactly the same as the image in Alice’s folder by looking at them or by using the following command from the root folder</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>diff <span class="nt">-s</span> alice/alice.jpg bob/alice.jpg
Files alice/alice.jpg and bob/alice.jpg are identical</code></pre></figure>
<p>It can also be verified that the key, created by OpenSSL from the secret for the decryption, is the same as the key created for the encryption. In case a non valid secret is used, the decryption will fail.<br />
This 2nd protocol enables Alice and Bob to send each other files of any size allowed by the channel, encrypted. Unfortunately it is subject to the man-in-the-middle attack. This is because a message sent over the Internet goes through different routers where a 3rd party, called Mallory in cryptography, can impersonate both Alice and Bob by sending them his public key instead of Bob’s and Alice’s respectively. Alice and Bob can solve this issue by publishing their public keys on a trusted website or by using certificates where their public keys are signed by a trusted 3rd party. The creation of certificates, even if possible with OpenSSL, requires the definition of a certificate authority and is beyond the scope of this post.</p>
<h2 id="message-integrity">Message integrity</h2>
<p>Encrypting a message is almost never enough. What we usually need is also to be sure that nobody can tamper our communications by intercepting our messages, dropping some of them or modifying them even when they are encrypted, without being detected. We need message integrity. The way in which the integrity of a message can be provided is by computing a value using a hash function that takes a message as input and outputs a short string, the digest or fingerprint. If the input message is modified, a hash function will output a complete different value in an unpredictable way. Even if the message space can be much larger than the digest space, the chances of a collision, in which a hash function outputs the same digest from two different messages, are practically negligible. Hash functions are used to support integrity in protocols such as TLS, SSH, IPsec. OpenSSL provides many hash functions such as SHA256, a standard function that hashes long messages into 256-bit digests. In some common use cases, encryption is not needed at all while integrity can be a strong security requirement. As an example, the integrity of a software package downloaded from the Internet can be checked comparing its fingerprint, provided on the website, with the fingerprint computed locally after the package has been downloaded and before it is executed. If the two fingerprints are the same we are assured that the software package has not been modified by an attacker during the transfer. Hash functions are also used to compute the fingerprint of public keys. For example, Bob can provide his public key’s fingerprint to Alice so that it will be easier for her to verify whether her copy of Bob’s public key is the right one. From Bob’s folder</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>openssl dgst <span class="nt">-sha256</span> <span class="nt">-hex</span> <span class="nt">-c</span> bob_rsa.pub <span class="o">></span> bob.fingerprint</code></pre></figure>
<p>The fingerprint can be verified more easily than the full public key</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span><span class="nb">cat </span>bob.fingerprint
SHA256<span class="o">(</span>bob_rsa.pub<span class="o">)=</span> 7f:98:0e:4f:a7:e4:5d:5f:bb:fb:f5:80:3a:32:b8:7e:2a:23:22:44:c4:da:8c:4d:eb:95:fa:f8:9c:5f:d9:24 </code></pre></figure>
<p>In the following section we will address another important use case in which a hash function is used to digitally sign a file.</p>
<h2 id="digital-signature">Digital signature</h2>
<p>Alice is a journalist and wants to send Bob an article, e.g. a pdf file, being sure than no one else can claim to be the author. Once again she comes up with a protocol that can solve her problem.</p>
<ol>
<li>Alice creates a one-way hash of a document, Alice’s digest.</li>
<li>Alice encrypts the digest with her private key, thereby signing the document.</li>
<li>Alice sends the document, her public key and the signed digest to Bob.</li>
<li>Bob decrypts Alice’s digest with her public key.</li>
<li>Bob creates a one-way hash of the document that Alice has sent, Bob’s digest.</li>
<li>Bob compares his digest with Alice’s to find out if they match</li>
</ol>
<p>If the signed hash matches the hash he generated, the signature is valid. Let’s say Alice wants to send a file, e.g. article.pdf, with her digital signature to Bob.</p>
<h3 id="1-alice-creates-a-one-way-hash-of-a-document-alices-digest">1. Alice creates a one-way hash of a document, Alice’s digest.</h3>
<p>Alice can sign the message choosing one hash function, e.g. SHA-256 . She can create the one-way hash of the message, also known as the digest, with</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>openssl dgst <span class="nt">-sha256</span> article.pdf <span class="o">></span> alice.dgst</code></pre></figure>
<p>The content of the digest will be similar to</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">SHA256(article.pdf)= cb686d3838cba15e5e603b8fa5191759a46227230884e20325abd19fb997f064</code></pre></figure>
<h3 id="2-alice-encrypts-the-digest-with-her-private-key-thereby-signing-the-document">2. Alice encrypts the digest with her private key, thereby signing the document.</h3>
<p>The next step is to encrypt the digest of the hash function, data.dgst, with her private key</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>openssl rsautl <span class="nt">-sign</span> <span class="nt">-inkey</span> alice_rsa <span class="nt">-keyform</span> PEM <span class="nt">-in</span> alice.dgst <span class="o">></span> alice.sign</code></pre></figure>
<h3 id="3-alice-sends-the-document-and-the-signed-digest-to-bob">3. Alice sends the document and the signed digest to Bob.</h3>
<p>Alice sends the document, article.pdf, with her signature, alice.sign and her public key, to Bob. Bob can verify Alice’s signature of the document using her public key. Again we will simulate the sending of the files by copying them from Alice’s folder to Bob’s.</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span><span class="nb">cp </span>article.pdf alice.sign alice_rsa.pub ../bob</code></pre></figure>
<h3 id="4-bob-decrypts-alices-digest-with-her-public-key">4. Bob decrypts Alice’s digest with her public key.</h3>
<p>From Bob’s folder</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>openssl rsautl <span class="nt">-verify</span> <span class="nt">-inkey</span> alice_rsa.pub <span class="nt">-pubin</span> <span class="nt">-keyform</span> PEM <span class="nt">-in</span> alice.sign <span class="nt">-out</span> alice.dgst</code></pre></figure>
<p>The output, alice.dgst, is Alice’s digest of the document, extracted from her signature of the document.</p>
<h3 id="5-bob-creates-a-one-way-hash-of-the-document-that-alice-has-sent-bobs-digest">5. Bob creates a one-way hash of the document that Alice has sent, Bob’s digest.</h3>
<p>Bob can again compute the hash of the document data.txt using the same hash function SHA-256 that has been used by Alice</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>openssl dgst <span class="nt">-sha256</span> article.pdf <span class="o">></span> bob.dgst</code></pre></figure>
<h3 id="6-bob-compares-his-digest-with-alices-to-find-out-if-they-match">6. Bob compares his digest with Alice’s to find out if they match</h3>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>diff <span class="nt">-s</span> alice.dgst bob.dgst</code></pre></figure>
<p>The result of the comparison is</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">Files alice.dgst and bob.dgst are identical</code></pre></figure>
<p>proving that Alice has signed the document. The signature can not be repudiated and the document can not be changed without compromising the validity of the signature.</p>
<h2 id="conclusion">Conclusion</h2>
<p>We have seen how to use OpenSSL to add some level of security to our communications with the public-key cryptography and the symmetric encryption. As previously cautioned, the protocols we have shown are not completely secure, but they will certainly limit the number of eavesdroppers capable of figuring out the content of your digital assets sent over the Internet. You can get more information on cryptography, algorithms and how protocols can be improved to enhance the security of the communications, by consulting the resources in the references.</p>
<h2 id="acknowledgments">Acknowledgments</h2>
<p>Thanks to Eurydice Prentoulis for proof-reading the text.</p>
<h2 id="references">References</h2>
<ol>
<li><a href="https://www.schneier.com/books/applied_cryptography/">Bruce Schneier - Applied Cryptography, 2nd Edition</a></li>
<li><a href="https://wstein.org/ent/">William Stein - Elementary Number Theory: Primes, Congruences, and Secrets</a></li>
<li><a href="https://crypto.stanford.edu/~dabo/cryptobook/">Dan Boneh, Victor Shoup - A Graduate Course in Applied Cryptography</a></li>
<li><a href="http://cacr.uwaterloo.ca/hac/">Alfred J. Menezes, Paul C. van Oorschot, Scott A. Vanstone - Handbook of Applied Cryptography</a></li>
<li><a href="https://www.coursera.org/learn/crypto">Dan Boneh - Cryptography I (Coursera)</a></li>
<li><a href="http://www.crypto-textbook.com/">Christoph Paar, Jan Pelzl - Understanding Cryptography</a></li>
</ol>The purpose of this post is to explain how to communicate privately over the Internet using public-key cryptography and how to digitally sign a document. Introduction Being able to communicate privately is a civil right and often a business need. As we can not allow anyone to eavesdrop our communications, we have also the right to avoid surveillance by companies or governments. There are many tools and protocols, many being open source and free, that can be used to enhance the security of our communications over the Internet. The aim of this post is to provide a very high level description of the ideas behind these tools and protocols and practical guidance on how to use one of them, OpenSSL, which is open source, free and used to secure most of the communications over the Internet. In particular in this post we will show How to avoid being eavesdropped while sending files to our friends or collaborators over the internet How to digitally sign a document It is supposed that you are using a Linux distribution or a Mac with OpenSSL version 1.0.2 installed. In case you use Windows you might want to install Cygwin with openssl. It is assumed that you know how to use the command line. Alice and Bob We will set up a context for the secure communication problem using two characters, Alice and Bob. We will simulate the transmission of encrypted messages between Alice and Bob by copying files from Alice’s folder to Bob’s and vice-versa on our local file system. This simulation is meant for you to easily check what happens on both sides when they send or receive messages using OpenSSL, but it must be kept in mind that it bypasses the core business of encryption that is about sending messages over an insecure channel such as the Internet where other parties could eavesdrop or interfere with Alice’s and Bob’s communication. With this warning in mind, let’s start our simulation by creating a folder for Alice’s messages and one for Bob’s $ mkdir alice $ mkdir bob Let’s imagine that Bob can’t remember his bank account details and asks Alice to send them to him by email. Alice is aware that sending the data as plain text over the Internet is risky so she wonders how to send the data to Bob in such a way that nobody else but he can read and use the data. After some investigation, Alice decides that the solution to their problem is public-key cryptography and the OpenSSL tools. Public-key cryptography Public-key cryptography consists of creating a key pair, namely a private key and a public key, to encrypt and decrypt messages. The private key is kept secret and is never shared with anyone. Alice uses Bob’s public key to encrypt the messages being sent to him. Bob uses his private key to decrypt the messages encrypted with his public key. The public key can even be published on the Internet for that matter. Only the owner of the private key can decrypt a message encrypted with his or her public key. There are different ways of creating a key pair but all are based on defining mathematical problems that are very difficult to solve in a short time scale, such as factorizing a number that is the product of two big prime numbers. This class of problems is used in the Rivest-Shamir-Adleman (RSA) cryptosystem. The idea is to find two prime numbers big enough, e.g. with more than 150 digits, so that it would be very difficult even for a cluster of computers to find them out in decades while it is very easy to compute their product. In RSA, the public key is the product of two prime numbers and the private key is the set of the two prime numbers themselves. An eavesdropper who wants to decrypt a message would need to extract the private key, i.e. the two prime numbers, from their product. In other words, the eavesdropper must be able to factorize a number that is the product of two big prime numbers, which in itself is an hard enough problem. Using RSA we can be confident that nobody will be able to decrypt our messages. The algorithm used for the encryption is well known and publicly available. The only thing that is not public, and known only to the owner of the key pair, is the private key. Let’s see what Alice and Bob have to do to keep their communication private: Alice and Bob create their own private and public keys. Bob sends Alice his public key. Alice encrypts her message using Bob’s public key and sends it to Bob. Bob decrypts Alice’s message using his private key. So, first of all, both Alice and Bob need a key pair. 1. Alice and Bob create their own private and public keys. Alice doesn’t yet have a key pair, so she needs to create it. As an example she may use the RSA cryptosystem. Her private key will be stored in a file, e.g. alice_rsa. The size of the private key will be 2048 bit. Let’s move into Alice’s folder and execute the command $ openssl genpkey -algorithm RSA -out alice_rsa -pkeyopt rsa_keygen_bits:2048 The private key in alice_rsa is saved in the Privacy-Enhanced Mail (PEM) format and looks like the following $ cat alice_rsa -----BEGIN PRIVATE KEY----- MIIEvgIBADANBgkqhkiG9w0BAQEFAASCBKgwggSkAgEAAoIBAQC68nDsjtWepLcM pF4zVaMdFsVdg692M5Mj9v/vGvgyyPHpHmH/QKolOB9KtlUZcth6d7fwmFgyaa/m XN1HjORKrpzm0rysPpFXJymUmIGy9XLzvPP4phJS3oGsnjsQJ6O017uWt8kqgz5U U+hYc1/AOUCA9vEP1AN+6fT6O11zZQ75aeJOK6aESV/++7ZaM4M8maVrCKFhonBU L9ByHNDkgQMWIu2iGazb7FZ5xDHWq+wBpfJZe8/3LNjS7VpUNeqoCsBuUE4fWHYo TSPK3CmhpWHtk9EnMxyho/rpt6/ETYPOQ3QV6Uxz4G9tDLpJzgL2Q4VHVKwqJSnT BLrvVOnxAgMBAAECggEBAK0VbDHIqMVh0Ux2HfU/U27KN182Xcx9Qbzpodm5yZQT cc4Y4DhYoW8mP+qHV9DhAMaacwXhtr6uFTqePg1Rx8fRVNlswVxj7WKYkqnObT7I e25pQiSzdYGeGsc8FIkHek0j870+WZTvwFSI/zRtVXh+SVddyqCR9c6aQ8MuFX6Q u9LzGNcYTg6Dmv8qsrXlctkJRvLOuKajaAG3AHT6f5GTXRDDhk9/Ab4h9Dorkgen Z0fg9yLfvlltO6z3z99VHIMlWX4TZ67kGP7L+AqPpzN9Qj9G15h2Blb6InlB5J/q pgKUVGGVC0TTILLjXLUT3xEvrWvpqHIAj80Xejuz0XUCgYEA5CAIZdKyqTrfNKTg B7ZxIX23R/YoqU6SGYDJ/8mz8Z+0PdlCrFb/fTvb2e8aQWSfYTccwhpPH9ZGQ2zL hXImNxPeVPymVYojZ8XQwwsd2KoK1jkuLr7uzcs393P5PB0YoCzB4kvsnlozcyrm +lTz913eHd6BsXoRx8GeRPs3nx8CgYEA0cpRr8+EBYgWCgJ71cpB8BLZ+kZErz5W p9GO5CD/wqaJ+Ljzrvr+XmbCFzDaf/KPTcYeFD7bz9aYq3SSavvnNSQQsHhjuphb CE40eR/fLwubyYhjOdHXjdYrxsI2gF7FyOO25PWx2OLoCqZITBYlaOdgxefQNtlM 1boATrYZJO8CgYBCuX/bUIqDZz3cJxGED//9HMlcGgsAooOnQ/1RfMzOMrlEkeSn hfbKyZRfpUkXsXfQto8J0yorlMAOfqb0zFOTLpOMZi28vV/nvXt3YSwEsI/k4uq4 L46n0PX4wgo3ZAdM6mp3Z1+5XYbI+9Z9iBWn1+Pc9rUWlS7YL7C8WoKFXwKBgCmI w7lp/TpXIf3jVf8SpxFPuiYpqUmErwVUoNSbj+dKr4A1pdEb0iaAc6bBvlCchjCg q63YcA5q7xjq4F4b9z93H3LAswXrSgKP8SWV4Mrgonw462Q0HlfvcgVMyBuMJ95I 7xnPZuGIsuYA28lsjQWC4Y7tATUKuoKJ66ups7qzAoGBAJyKVY2ZqpkEHlzMixnk BBKZA9sccokOYWtVtnCxWZYnnG7ElOBvojuLtf+/stvIadnCVe7km6f6J50QcqtH 1g6eTMfEoqkXG5plBlcEbjEv+wAGO9RXCiyYNquUuwjMrgv8dqUpHGXdw6XxxGi6 LTf0HIwHOpMNVVyptpRZoCH/ -----END PRIVATE KEY----- The public key can be created from the private one, and saved in e.g. alice_rsa.pub, with the command $ openssl rsa -in alice_rsa -pubout -out alice_rsa.pub Alice’s public key will look like $ cat alice_rsa.pub -----BEGIN PUBLIC KEY----- MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAuvJw7I7VnqS3DKReM1Wj HRbFXYOvdjOTI/b/7xr4Msjx6R5h/0CqJTgfSrZVGXLYene38JhYMmmv5lzdR4zk Sq6c5tK8rD6RVycplJiBsvVy87zz+KYSUt6BrJ47ECejtNe7lrfJKoM+VFPoWHNf wDlAgPbxD9QDfun0+jtdc2UO+WniTiumhElf/vu2WjODPJmlawihYaJwVC/QchzQ 5IEDFiLtohms2+xWecQx1qvsAaXyWXvP9yzY0u1aVDXqqArAblBOH1h2KE0jytwp oaVh7ZPRJzMcoaP66bevxE2DzkN0FelMc+BvbQy6Sc4C9kOFR1SsKiUp0wS671Tp 8QIDAQAB -----END PUBLIC KEY----- Now we have Alice’s key pair in her folder. Let’s do the same for Bob. We move into Bob’s folder and create his key pair, stored in e.g. bob_rsa and bob_rsa.pub, as we did for Alice. After Alice and Bob have their key pair we are done with the 1st step of the procedure. 2. Bob sends Alice his public key. Let’s move to the 2nd step: Bob must send his public key to Alice so she will be able to send him her message encrypted. We simulate this by copying Bob’s public key file, bob_rsa.pub, in Alice’s folder. From Bob’s folder $ cp bob_rsa.pub ../alice/ As soon as a copy of Bob’s public key is in Alice’s folder, the 2nd step of the procedure is complete and we can move to the 3rd: Alice will encrypt her message using Bob’s public key and will send it to Bob. 3. Alice encrypts her message using Bob’s public key and sends it to Bob. Bob’s public key can now be used by Alice with OpenSSL to encrypt her message stored in a file, e.g. data.txt, containing sensitive information Bob Bank Account userid: 123456 password: 276f8%2=0as} pin: 4657 In our example the size of the file is only 65 bytes. Alice encrypts the file using OpenSSL and Bob’s public key that she has received from him, e.g. by email, which we have simulated by simply copying the file from Bob’s folder to Alice’s. From Alice’s folder $ openssl rsautl -encrypt -pubin -inkey bob_rsa.pub -in data.txt -out data.txt.enc Now Alice can send her encrypted message, data.txt.enc. The encrypted message is a binary file whose content doesn’t make any sense and can be decrypted only by Bob using his private key. The RSA encryption algorithm is randomized, and executing again the same command will result in a different ciphertext but when they are decrypted the output will be exactly the same message. If Alice were a real person she would be able to send it to Bob by email. We will once again simulate the sending of the encrypted message by copying it in Bob’s folder. From Alice’s folder $ cp data.txt.enc ../bob/ As soon as the encrypted message has been received by Bob, in our simulation when it has been copied in Bob’s folder, the 3rd step is complete. We can move to the 4th and last step. 4. Bob decrypts Alice’s message using his private key. From Bob’s folder $ openssl rsautl -decrypt -inkey bob_rsa -in data.txt.enc -out data.txt Bob can open the file data.txt containing the original message in plain text that Alice wanted to send to him. We can easily verify that Bob’s decrypted message and Alice’s original message are exactly the same. From the root folder $ diff -s alice/data.txt bob/data.txt Files alice/data.txt and bob/data.txt are identical The procedure that Alice chose to send her message to Bob, without risking anyone else reading it, is complete. In this example Alice did not use her private or public key. In case Bob wanted to send her feedback, he could use Alice’s public key to encrypt his message, so that only she would be able to decrypt it, using her private key. Both Alice and Bob must keep their private keys in a very safe place. The private key we have just created for them can be used by anyone who has access to it. One way to protect the private key is to encrypt it using an algorithm, e.g. AES-256, with a password so that only the person who knows the password can decrypt the private key and use it. For example, Alice could have made her private key safer by creating it with the following command $ openssl genpkey -algorithm RSA -out alice_rsa -pkeyopt rsa_keygen_bits:2048 -aes-256-cbc -pass pass:wT16pB9y where wT16pB9y would be Alice’s password. Currently OpenSSL supports only alphanumeric characters for passwords. Hybrid cryptosystem Alice has successfully solved Bob’s problem. She has been able to send him his bank account details in a secure way. Now she wants to send Bob a file, e.g. a jpeg picture that she doesn’t want anyone else to see, and whose size is some KB. Let’s try to encrypt the image on behalf of Alice $ openssl rsautl -encrypt -pubin -inkey bob_rsa.pub -in alice.jpg -out alice.jpg.enc This time OpenSSL will raise an error RSA operation error 4294956672:error:0406D06E:,rsa routines:RSA_padding_add_PKCS1_type_2:data too large for key size:rsa_pk1.c:174: The problem is that the RSA algorithm can be used only to encrypt messages whose size is smaller than the size of the private key that corresponds to the public key used for the encryption. Since Bob’s private key is 2048 bit long, or 256 bytes, his public key cannot be used to encrypt messages that are bigger than 256 bytes. The best option to solve this issue is to use a symmetric algorithm. A symmetric algorithm can use only one key, called a symmetric key, for encryption and decryption. Once a message has been encrypted with the symmetric key, it can be sent, with the symmetric key encrypted using the public key of the recipient, so he or she will be able to decrypt the message. One more reason to use a symmetric algorithm to encrypt a message is that they are three orders of magnitude faster than asymmetric ones. The algorithms used in the symmetric key encryption are different from those used in public-key encryption. The symmetric key algorithms use a key that is based on a pseudo-random value taken from a huge range of possible values. The key is shared only by the two communicating parties. The strength of the algorithm rests in the difficulty of finding the key within a huge key space. The way in which the symmetric key must be created depends on the cryptographic algorithm, also called cipher. One of the most robust ciphers is AES-256, that we have already used to encrypt Alice’s private key. OpenSSL creates the symmetric key, to be used with the AES-256 cipher, from a secret string, in short secret, that can be created and stored in a file. Alice defines a new protocol in which she will create the secret that she will use to encrypt her picture and that she will share with Bob. The system that she is going to use is called a hybrid cryptosystem because it uses public-key and symmetric cryptography together. Alice creates the secret. Alice encrypts the data using the AES-256 cipher and the secret. Alice encrypts the secret using Bob’s public key. Alice sends the encrypted data and the encrypted secret to Bob. Bob decrypts the secret using his private key. Bob decrypts the data using the AES-256 cipher and the secret. Let’s implement these steps on behalf of Alice and Bob using OpenSSL. 1. Alice creates the secret. First, Alice creates a secret, e.g. a sequence of 32 random bytes, using a pseudo-random bytes generator provided by OpenSSL. The longer is the sequence of random bytes the more difficult is for an eavesdropper to figure it out. It should never be so short that it could be found simply by brute force. For example a sequence of only 2 bytes (16 bits) can be found in just 2^16 = 65536 attempts. From Alice’s folder $ openssl rand 32 -out secret 2. Alice encrypts the data using the AES-256 cipher and the secret. The command will encrypt the image $ openssl enc -e -aes-256-cbc -in alice.jpg -out alice.jpg.enc -pass file:secret -p and will print the key created by OpenSSL from the secret salt=469950DBF6FA435A key=E94C0C70A8BC662DB270C57B642C010910B65118C97DD37088E84F6DC3627225 iv =B83AB9A6A80D67DFA6B3572EB850EE0D The AES-256 cipher is a block cipher that encrypts a fixed block of 128 bits of the message at a time with a 256 bits long key. The mode of operation used in the example is Cipher Block Chaining (CBC). The CBC operation mode is a scheme that allows the use of a block cipher to encode strings longer than the block size (16 bytes). The key, created by OpenSSL from the secret, is shown as a result of the encryption with other random parameters, salt and iv. The salt parameter is created by OpenSSL to be added as a suffix to the secret to mitigate directory attacks, when the secret has not been chosen wisely and it could easily be found simply by brute force attack. The iv parameter is the initialization vector used as the content of the first block. It ensures that no information can be extracted by an attacker from messages that may start with some common header. All the parameters, key, salt and iv, are recreated newly every time the command is executed, even if the file and the secret are the same. 3. Alice encrypts the secret using Bob’s public key. $ openssl rsautl -encrypt -pubin -inkey bob_rsa.pub -in secret -out secret.enc 4. Alice sends the encrypted data and the encrypted secret to Bob. We can simulate the sending of the encrypted data and secret by copying them from Alice’s folder to Bob’s. $ cp alice.jpg.enc secret.enc ../bob 5. Bob decrypts the secret using his private key. From Bob’s folder $ openssl rsautl -decrypt -inkey bob_rsa -in secret.enc -out secret 6. Bob decrypts the data using the AES-256 cipher and the secret. $ openssl enc -d -aes-256-cbc -in alice.jpg.enc -out alice.jpg -pass file:secret -p salt=469950DBF6FA435A key=E94C0C70A8BC662DB270C57B642C010910B65118C97DD37088E84F6DC3627225 iv =B83AB9A6A80D67DFA6B3572EB850EE0D You can verify that the image in Bob’s folder is exactly the same as the image in Alice’s folder by looking at them or by using the following command from the root folder $ diff -s alice/alice.jpg bob/alice.jpg Files alice/alice.jpg and bob/alice.jpg are identical It can also be verified that the key, created by OpenSSL from the secret for the decryption, is the same as the key created for the encryption. In case a non valid secret is used, the decryption will fail. This 2nd protocol enables Alice and Bob to send each other files of any size allowed by the channel, encrypted. Unfortunately it is subject to the man-in-the-middle attack. This is because a message sent over the Internet goes through different routers where a 3rd party, called Mallory in cryptography, can impersonate both Alice and Bob by sending them his public key instead of Bob’s and Alice’s respectively. Alice and Bob can solve this issue by publishing their public keys on a trusted website or by using certificates where their public keys are signed by a trusted 3rd party. The creation of certificates, even if possible with OpenSSL, requires the definition of a certificate authority and is beyond the scope of this post. Message integrity Encrypting a message is almost never enough. What we usually need is also to be sure that nobody can tamper our communications by intercepting our messages, dropping some of them or modifying them even when they are encrypted, without being detected. We need message integrity. The way in which the integrity of a message can be provided is by computing a value using a hash function that takes a message as input and outputs a short string, the digest or fingerprint. If the input message is modified, a hash function will output a complete different value in an unpredictable way. Even if the message space can be much larger than the digest space, the chances of a collision, in which a hash function outputs the same digest from two different messages, are practically negligible. Hash functions are used to support integrity in protocols such as TLS, SSH, IPsec. OpenSSL provides many hash functions such as SHA256, a standard function that hashes long messages into 256-bit digests. In some common use cases, encryption is not needed at all while integrity can be a strong security requirement. As an example, the integrity of a software package downloaded from the Internet can be checked comparing its fingerprint, provided on the website, with the fingerprint computed locally after the package has been downloaded and before it is executed. If the two fingerprints are the same we are assured that the software package has not been modified by an attacker during the transfer. Hash functions are also used to compute the fingerprint of public keys. For example, Bob can provide his public key’s fingerprint to Alice so that it will be easier for her to verify whether her copy of Bob’s public key is the right one. From Bob’s folder $ openssl dgst -sha256 -hex -c bob_rsa.pub > bob.fingerprint The fingerprint can be verified more easily than the full public key $ cat bob.fingerprint SHA256(bob_rsa.pub)= 7f:98:0e:4f:a7:e4:5d:5f:bb:fb:f5:80:3a:32:b8:7e:2a:23:22:44:c4:da:8c:4d:eb:95:fa:f8:9c:5f:d9:24 In the following section we will address another important use case in which a hash function is used to digitally sign a file. Digital signature Alice is a journalist and wants to send Bob an article, e.g. a pdf file, being sure than no one else can claim to be the author. Once again she comes up with a protocol that can solve her problem. Alice creates a one-way hash of a document, Alice’s digest. Alice encrypts the digest with her private key, thereby signing the document. Alice sends the document, her public key and the signed digest to Bob. Bob decrypts Alice’s digest with her public key. Bob creates a one-way hash of the document that Alice has sent, Bob’s digest. Bob compares his digest with Alice’s to find out if they match If the signed hash matches the hash he generated, the signature is valid. Let’s say Alice wants to send a file, e.g. article.pdf, with her digital signature to Bob. 1. Alice creates a one-way hash of a document, Alice’s digest. Alice can sign the message choosing one hash function, e.g. SHA-256 . She can create the one-way hash of the message, also known as the digest, with $ openssl dgst -sha256 article.pdf > alice.dgst The content of the digest will be similar to SHA256(article.pdf)= cb686d3838cba15e5e603b8fa5191759a46227230884e20325abd19fb997f064 2. Alice encrypts the digest with her private key, thereby signing the document. The next step is to encrypt the digest of the hash function, data.dgst, with her private key $ openssl rsautl -sign -inkey alice_rsa -keyform PEM -in alice.dgst > alice.sign 3. Alice sends the document and the signed digest to Bob. Alice sends the document, article.pdf, with her signature, alice.sign and her public key, to Bob. Bob can verify Alice’s signature of the document using her public key. Again we will simulate the sending of the files by copying them from Alice’s folder to Bob’s. $ cp article.pdf alice.sign alice_rsa.pub ../bob 4. Bob decrypts Alice’s digest with her public key. From Bob’s folder $ openssl rsautl -verify -inkey alice_rsa.pub -pubin -keyform PEM -in alice.sign -out alice.dgst The output, alice.dgst, is Alice’s digest of the document, extracted from her signature of the document. 5. Bob creates a one-way hash of the document that Alice has sent, Bob’s digest. Bob can again compute the hash of the document data.txt using the same hash function SHA-256 that has been used by Alice $ openssl dgst -sha256 article.pdf > bob.dgst 6. Bob compares his digest with Alice’s to find out if they match $ diff -s alice.dgst bob.dgst The result of the comparison is Files alice.dgst and bob.dgst are identical proving that Alice has signed the document. The signature can not be repudiated and the document can not be changed without compromising the validity of the signature. Conclusion We have seen how to use OpenSSL to add some level of security to our communications with the public-key cryptography and the symmetric encryption. As previously cautioned, the protocols we have shown are not completely secure, but they will certainly limit the number of eavesdroppers capable of figuring out the content of your digital assets sent over the Internet. You can get more information on cryptography, algorithms and how protocols can be improved to enhance the security of the communications, by consulting the resources in the references. Acknowledgments Thanks to Eurydice Prentoulis for proof-reading the text. References Bruce Schneier - Applied Cryptography, 2nd Edition William Stein - Elementary Number Theory: Primes, Congruences, and Secrets Dan Boneh, Victor Shoup - A Graduate Course in Applied Cryptography Alfred J. Menezes, Paul C. van Oorschot, Scott A. Vanstone - Handbook of Applied Cryptography Dan Boneh - Cryptography I (Coursera) Christoph Paar, Jan Pelzl - Understanding Cryptography