A quick guide to getting data using JavaScript.


In this article I would like to show one of the most basic ways to scrap data from a website. What this method lacks in elegance, it makes up for in efficiency. What we do is the following:

  • Go to the website we want to scrap
  • Use Google’s inspector and console to analyze the site
  • Run a JavaScript function to get the data
  • Save the data into a text file
While this method is useful only for simple projects, it does not require the user to learn more complex tools like BeautifulSoup or Selenium. The code is posted here.

Context

Brazil has an anticorruption program that audits local governments every month. The local governments are selected randomly through a monthly national contest. The goal of this project is to get metadata on each contest. In particular, we are interested in getting the contest number and date. The contest data is available on this site.

To get the data, we first need to look at Google Inspect to understand how the data is stored.
Milenko Fadic- Brazil Site.
Milenko Fadic- Brazil Site.
The date data we want is located under the class "summary-view-icon". We go to the console and get all elements in the page with that class name.
Milenko Fadic- Brazil Site.

All elements with class "summary-view-icon" are now stored in our array B. We look at the text inside each element of our array and get the following results:
Milenko Fadic- Brazil Site.
We see that although the array B contains the dates, it also contains other data that we don't want. We trim and replace that data and get the following.
Milenko Fadic- Brazil Site.
We see that this is not enough because our array contains other classes. There are many ways to solve this problem and get only the dates. The easiest way to solve this, is to notice that the dates always follow the same format. In particular, the third character of each date is always “/”. Therefore we can place a condition that looks to see if the third element is "/". If so, we then added to a new array called dates
Milenko Fadic- Brazil Site.
We now have all dates in the dates array. Finally, I create a function to print it all in the console.
Milenko Fadic- Brazil Site.
Now I can copy the text and paste the text into a CSV file. That is it!
Disclaimer: Note that this procedure is meant for a rough data gathering or as a proof of concept.

Comments