In this article I would like to show one of the most basic ways to scrap data from a website. What this method lacks in elegance, it makes up for in efficiency. What we do is the following:
- Go to the website we want to scrap
- Use Google’s inspector and console to analyze the site
- Save the data into a text file
ContextBrazil has an anticorruption program that audits local governments every month. The local governments are selected randomly through a monthly national contest. The goal of this project is to get metadata on each contest. In particular, we are interested in getting the contest number and date. The contest data is available on this site.
To get the data, we first need to look at Google Inspect to understand how the data is stored.
The date data we want is located under the class "summary-view-icon". We go to the console and get all elements in the page with that class name.
All elements with class "summary-view-icon" are now stored in our array B. We look at the text inside each element of our array and get the following results: We see that although the array B contains the dates, it also contains other data that we don't want. We trim and replace that data and get the following. We see that this is not enough because our array contains other classes. There are many ways to solve this problem and get only the dates. The easiest way to solve this, is to notice that the dates always follow the same format. In particular, the third character of each date is always “/”. Therefore we can place a condition that looks to see if the third element is "/". If so, we then added to a new array called dates We now have all dates in the dates array. Finally, I create a function to print it all in the console. Now I can copy the text and paste the text into a CSV file. That is it!
Disclaimer: Note that this procedure is meant for a rough data gathering or as a proof of concept.