Presentazione di PowerPoint
Experiences in the Use of Big Data for Official Statistics
Antonino Virgillito Istat
Think Big - Data innovation in Latin America Santiago, Chile 6th March 2017
Antonino Virgillito Experiences in the Use of Big Data for Official Statistics
Introduction The use of Big Data sources for the production of official statistics has been the subject of a lively discussion within the statistical community in recent years, producing a significant body of study and work
Particularly, Istat has developed an extensive experience in this area, with several ongoing initiatives
We present some of these experiences and highlight results and lessons learned
Antonino Virgillito Experiences in the Use of Big Data for Official Statistics
The Big Data Global Trend
Result of searching “Big Data” in Google Trends
Antonino Virgillito Experiences in the Use of Big Data for Official Statistics
Everyone started from ground zero, many lessons learned together
Several initiatives at national and international level were organized
The Path to Big Data in Official Statistics
Official statistics community gets involved
UNECE Big Data Project (2014 - 2015)
ESSnet Big Data project (2016 - 2018)
Demonstrate feasibility of production based on Big Data sources
Integration of Big Data in the regular production of official statistics
Antonino Virgillito Experiences in the Use of Big Data for Official Statistics
Experiences in Istat
Web Scraping
Scraping of enterprise web sites to determine information about enterprises
Extraction of prices of products from e-commerce web sites
Scanner data Production of CPI Indexes
Business Statistics
Antonino Virgillito Experiences in the Use of Big Data for Official Statistics
Web Scraping
Scrape textual content of a large number of web sites and analyze it offline to determine some information through text mining techniques
Extract specific information from semi-structured web sites through custom software or automation tools (robots)
Two approaches
Antonino Virgillito Experiences in the Use of Big Data for Official Statistics
Web Scraping Enterprise Web Sites General Objective: to investigate whether web scraping, text mining and inference techniques can be used to collect, process and improve general information about enterprises
Enterprises Web sites
National Business Register
Business Statistics Surveys
Crawling Scraping Indexing Searching
Antonino Virgillito Experiences in the Use of Big Data for Official Statistics
Using Scraped Data
Use case 1: URLs Inventory
Use case 2: Web sales - ECommerce
Use Case 3: Social Media
Presence
Use Case 4: Job
Advertisement
Antonino Virgillito Experiences in the Use of Big Data for Official Statistics
Use Case 1: URLs Inventory
• Main Identified Population (ICT Survey): – Enterprises with at least 10 Employees – Not all of them have a web site, but for those of them that do
have it, the URLs of the web sites are not fully available • The URLs Retrieval problem:
– Given a set of identifiers (denomination, fiscal code, economic activity, etc.) for the enterprise X, searching the Web for
• Retrieving a set of associated URLs • Estimate (if any) which is the URL corresponding to the web site of X
Antonino Virgillito Experiences in the Use of Big Data for Official Statistics
URLs Inventory: Technique and Results
• Steps – Query a search engine for enterprise name – Crawl the returned pages and score them according to content – Classify the results with machine learning approach
• Machine learning step – Logisitic model fitted on a training set, and then applied to the set of all
other enterprises – Application of the model to the set of unlabeled (i.e. not belonging to the
training set) enterprises • Total number of identified URLs equal to about 105,000 out of
130,000 websites pertaining to the enterprises population – 81% coverage
Antonino Virgillito Experiences in the Use of Big Data for Official Statistics
Use Case 2: Web sales - Ecommerce
ICT Survey
Predict whether an enterprise provides or not web sales facilities on its website
Antonino Virgillito Experiences in the Use of Big Data for Official Statistics
Use Case 3: Social Media Presence
Information on existence of the particular enterprises
in social media (mainly Twitter and
Facebook)
Antonino Virgillito Experiences in the Use of Big Data for Official Statistics
Use Case 4: Job Advertisement
Investigating how enterprises use their websites to handle the job advertisements, and in particular if they publish job advertisement or not
Antonino Virgillito Experiences in the Use of Big Data for Official Statistics
Technique and Results
• Prediction realized through different classification algorithms – logistic model, classification trees, random forests, boosting,
bagging, neural net, Naive Bayes, SVM • Evaluation of algorithms performance according to
different indicators • Quality of results still to be improved
– Social media use case almost ready for production
Antonino Virgillito Experiences in the Use of Big Data for Official Statistics
Web Scraping Prices
• The collection of data from the Internet through the extraction of structured content from web pages is an established technique for statistical data collection – Replace repetitive centralized tasks – Possibility for getting more data
• Price data is particularly attractive… – A lot of prices on the Internet! – Common practice in European NSIs
Antonino Virgillito Experiences in the Use of Big Data for Official Statistics
Web Scraping Prices at Istat
Consumer electronic products: collection of prices from 4 different e-commerce web sites, including Amazon.
(...) Transport sector: cost of tickets for trains and flights (Experimental)
17 types of products currently collected through scraping in production
Antonino Virgillito Experiences in the Use of Big Data for Official Statistics
Problems
• Sustainability – The more we develop scraping solutions, the more maintenance is
required – Maintenance requires dedicated IT resources
• Scale
– Scraping for prices is substantially a replacement for manual collection activity
– Difficult to collect large data size – Data must be selected manually before collection
Antonino Virgillito Experiences in the Use of Big Data for Official Statistics
Web Scraping Prices Results and Next Steps
• Significant improvements in efficiency have been achieved so far through tools and techniques that are now mature and familiar
• The risk is that we are not able to reach to the next level of scale and exploit the full potential of web data
• Are we ready to try new approaches?
Language:English
Score: 1891862.7
-
https://www.cepal.org/sites/de.../files/antonino_virgillito.pdf
Data Source: un
Scraping data from the web | FAO DataLab
Skip to main content
Print
Send
FAO Data Lab
Toggle navigation
Main navigation
About
Products
Food prices
Vulnerability maps
Social unrest analysis
Covid-19 impact analysis
Use of statistics in policy making
Closed products
News digests
Collaborations
Hand in Hand
FAO OCS and Paris 21
FAOSTAT
FLW database and SDG 12.3
FAOLEX
Methods & areas of work
Methods
Scraping data from the web
Text analytics
Data validation
Statistical modelling
Creation of digests
Areas
Analysis of (social) media
Filling gaps of statistical data
Analysis of legislation & policies
Use of geospatial data in ag statistics
Contact
Scraping data from the web
The internet grants a wide scope of facts and data sources, which consist in an enormous assortment of dissimilar and poorly organized data. Web scraping involves fetching and extracting those data from web pages, creating properly organized information. Web scraping is usually associated to the Big Data paradigm, considering the variety of data sources.
Language:English
Score: 1850279.5
-
https://www.fao.org/datalab/website/web/scraping-data-web
Data Source: un
REPORT OF THE MEETING OF THE GROUP OF EXPERTS ON CONSUMER PRICE INDICES, 13TH SESSION
These experiments are starting to produce
results in terms of compiling price indices based exclusively on scraped data.
(b) Experiments with different techniques for web scraping demonstrate two
main approaches: i) Using tools (robots) that reproduce and automate manual steps by
collecting data from Internet. ii) Implementing a specific parser (a software programme)
for each retailer web site that extracts structured price data from an unstructured web page.
(...) (d) Recording and classifying price observations correctly is a key common
challenge identified by NSOs when using web scraping. Supervised machine learning
methods are being experimented for identifying the correct COICOP category for each
scraped item from its generic text description found on the web site.
(...) Data seem to support the so-called 50% rule by
which half of a price increase is attributed to an increase in quality.
(f) Web scraping and scanner data are two aspects of Big Data that can be
combined within the regular production process of compiling CPI.

Language:English
Score: 1751907.8
-
https://daccess-ods.un.org/acc...DS=ECE/CES/GE.22/2016/2&Lang=E
Data Source: ods
Filling gaps of statistical data | FAO DataLab
Skip to main content
Print
Send
FAO Data Lab
Toggle navigation
Main navigation
About
Products
Food prices
Vulnerability maps
Social unrest analysis
Covid-19 impact analysis
Use of statistics in policy making
Closed products
News digests
Collaborations
Hand in Hand
FAO OCS and Paris 21
FAOSTAT
FLW database and SDG 12.3
FAOLEX
Methods & areas of work
Methods
Scraping data from the web
Text analytics
Data validation
Statistical modelling
Creation of digests
Areas
Analysis of (social) media
Filling gaps of statistical data
Analysis of legislation & policies
Use of geospatial data in ag statistics
Contact
Filling gaps of statistical data
The Data Lab collects data at the national (eventually filling any potential gaps of the National Statistical Systems) and sub-national (usually not collected by the FAO) levels to meet the need for more granular and more timely data in contexts where very little information is available, such as least developed countries, countries that lack territorial access to the sea, small island developing states, countries currently facing a food crisis, and highly populated countries.
The strategy for filling the data gaps consists mainly in the use of non-traditional sources , such as datasets, data catalogues on the web, and textual resources containing data. The methodology used is characterised by a blend of big data solutions (such as web scraping, crowdsourcing, etc.) and text-mining techniques (extracting data from documents).
(...) Food loss and waste data from non-conventional sources : the Data Lab scrapes from the world wide web all the publications containing data and information on food losses and waste (reports, studies, articles from various sources), and then analyses the results and models data with specific statistical methods.
Language:English
Score: 1725937.1
-
https://www.fao.org/datalab/we.../filling-gaps-statistical-data
Data Source: un
GE.14-08582 (E)
Big data should be dealt with, including web scraping. In favour of a
CPI-TEG.
(l) Canada: Supports update. (...) Price collection by web scraping or Internet robots offers a new possibility to
NSOs. (...) How can NSO extent the use of web
scraping? There were some cautions raised: changes in websites may cause problems for
the compilation of the regular CPI.
Language:English
Score: 1720911.5
-
daccess-ods.un.org/acce...DS=ECE/CES/GE.22/2014/2&Lang=E
Data Source: ods
REPORT OF THE INTERSECRETARIAT WORKING GROUP ON PRICE STATISTICS :NOTE / BY THE SECRETARY-GENERAL
At the meeting, which attracted 500 participants, new data
sources (scanner data and web scraping), quality changes and quality adjustment
methods, and meeting user needs were discussed. (...) Workshop on scanner data and web scraping
23. Eurostat, together with the scanner data task team13 of the Committee of Experts
on Big Data and Data Science for Official Statistics, organized a virtual workshop on
scanner data and web scraping from 12 to 14 October 2021. (...) The participants ha d the
opportunity to present and discuss their latest work in the fields of scanner data, web
scraping, classification and validation, as well as to participate in tutorials and
demonstrations.
Language:English
Score: 1720829
-
https://daccess-ods.un.org/acc...?open&DS=E/CN.3/2022/36&Lang=E
Data Source: ods
IMPLEMENTATION OF THE UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE STATISTICAL PROGRAMME 2020 - ADDENDUM - REPORT OF THE REGIONAL WORKSHOP ON CONSUMER PRICE INDICES
On data collection methods, Norway presented internet purchase and web scraping.
Using web scrapping as a tool to collect price information from the internet has many
benefits, such as more prices are collected in less time and it can serve as alternative to
scanner data. However, there are many challenges in working with web scraping, e. g.
websites can change frequently, and it requires resources.
Norway also provided an introduction to as well as a training on how to do web
scraping using R software.
B. Session 2: Seasonal items and missing items
The session included presentations by Georgia, Ukraine, Kazakhstan, Norway.
Language:English
Score: 1676815
-
https://daccess-ods.un.org/acc...S=ECE/CES/2020/14/ADD.3&Lang=E
Data Source: ods
REPORT
It was found promising to see more countries doing research on
scanner data and web scraping methods and applying these in practice. (...) Obtaining expenditure weights for web scraped prices continue to be a challenge,
and there is no obvious way of obtaining this information. (...) Countries may develop in-house software or buy this from a provider of software for
web scraping. Both ways have their advantages and disadvantages that countries must
consider.
20.
Language:English
Score: 1670544.5
-
https://daccess-ods.un.org/acc...DS=ECE/CES/GE.22/2018/2&Lang=E
Data Source: ods
REPORT OF THE OTTAWA GROUP ON PRICE INDICES : NOTE / BY THE SECRETARY-GENERAL
Alternative data such as web-scraped
data, transaction data, big data and administrative data pose challenges to tr aditional
index compilation procedures and methodologies. (...) New and innovative ideas
discussed included compiling indices using big data, transaction data and
web-scraped data. The full report of the meeting provides a summary of the key points
E/CN.3/2020/31
19-22146 4/4
that emerged from each session and feedback from the participants. (...) There will be a call for papers and discussions on:
(a) New data sources for the compilation of price indices (scanner and
web-scraped data; quality adjustment);
(b) Compiling house price indices (residential and commercial);
(c) Challenging areas of measurement (such as services);
(d) Conceptual frameworks (index number formulae; multipurpose price
statistics);
(e) Treatment of special cases (strongly seasonal products; zero prices).
15.
Language:English
Score: 1665922.3
-
https://daccess-ods.un.org/acc...?open&DS=E/CN.3/2020/31&Lang=E
Data Source: ods
EFFECT OF COVID-19 ON PRICE AND EXPENDITURE STATISTICS :COVID-19 COULD AFFECT THE REAL SIZE OF ARAB ECONOMIES
Directly before the pandemic, ESCWA conducted comprehensive training
sessions in two member States, namely Bahrain and Kuwait, on the use and application of web scraping
for price data collection, allowing them to start direct application in their offices. (...) The training aimed to assess the feasibility of price data collection through web scraping in Arab
countries, with a focus on certain categories of household consumption goods, such as fast evolving
technology items whose prices were to be scraped for the purposes of both CPI and ICP.
(...) ESCWA prepared two different web scraping templates for Qatar, corresponding to two
different online outlets that cover a wide range of household goods and services (not only items related
7
to fast evolving technology); and provided detailed electronic instructions in a manual to the Qatari
statistical office to help conduct price web scraping given that no formal training had yet been
conducted.
Language:English
Score: 1659839.8
-
https://daccess-ods.un.org/acc...CWA/CL4. SIT/2020/INF.1&Lang=E
Data Source: ods