Introduction to quantitative methods with

Introductory session

Introductory Chapter

Note

  • Exercises associated with this chapter here

Scan what you need

Introduction slides

Website

About us

  • 2 french instructors 🇫🇷
  • colleagues from INSEE: National Institute of Statistics and Economic Studies

All about us

Instructors

Nathan Randriamanana

All about us

Instructors

Clara is coming tomorrow! 🔒 🤩 🥳

Who am I then ?

  • Data scientist at Insee
    • INSEE statistician civil servant
    • working now at the Business Statistics Directorate
    • SIRENE business register

About my job

  • I wear two hats:
    • 🎩 Application Administrator: Acting as the functional lead for the SIRENE register. I bridge the gap between business needs and IT by writing functional specifications for system maintenance and evolution.
    • 🧢 Data Scientist: Managing the end-to-end ML workflow for APE classification from model training and delivery to its integration into the production application.

Example from presentation at the Cloud Native Days France 2026 conference regarding cloud technologies

Socials

I may not post every day, but I’d love to stay in touch!

LinkedIn

GitHub

About my courses

You might also be interested in my other introductory courses:

Most importantly, check out the full training portal on the Insee datalab

Getting to know each other

  • Let’s introduce ourselves
  • What are you looking for in this course?
    • Any topics you are passionate about?
    • Any specific expectations?

Training roadmap

  • Day 1: Setting the stage with hands-on R and fundamental data manipulation
  • Day 2: Data cleaning and transformation 🧹 🪄
  • Day 3: Descriptive and inferential statistics 🔎🎲
  • Day 4: Data visualisation (ggplot2) 📊
  • Day 5: Reporting with Quarto and final project 👑

Why are we doing this?

  1. Educational goals: Upskilling and fostering a culture of technical autonomy.
  2. Strategic stakes: Ensuring institutional independence, cost-efficiency, and scientific transparency.

1. Educational goals

  • Gaining technical depth: Moving from “black-box” tools to understanding the underlying code and logic.
  • Developing a self-service culture: Building the autonomy to create, debug, and improve your own solutions.

1. Educational goals

  • The open data & open source ecosystems:
    • open data: Learning to leverage public datasets as a primary resource.
    • open source: Discovering community-driven tools (R, Quarto) to process them.
  • Hands-on introduction to : Learning the fundamentals of R through practical application.
  • Introduction to reproducible publications with Quarto: Learn to create automated, transparent, and high-quality reports.

2. Strategic Stakes

  • Public fund stewardship & financial independence:
    • Optimizing national budget: Redirecting taxpayer money from recurring proprietary licenses toward internal expertise and innovation. 💸
    • Cost-efficiency: Scaling tools without proportional increases in software costs. 📈
  • Institutional Sovereignty:
    • Full ownership: Controlling our own statistical production chain without being tied to a vendor’s “black-box” roadmap. 🤔

2. Strategic Stakes

  • Institutional Sovereignty:
    • Sustainability: Ensuring long-term access to our code and methods, independent of private sector pricing policies.
  • Scientific & Public Trust:
    • Reproducibility: Guaranteeing that official statistics are auditable and transparent (open science standards).
    • Modernization: Maintaining the Institute’s position at the state-of-the-art of data science.

What you stand to gain

  • Understand the strategic value of open source in today’s landscape
  • Learn professional best practices 💎
  • Gain independence from proprietary solutions 🏢
  • Connect and collaborate with the global data community 🌍
  • Demystify the “industrialization” of data through practice
    • ➡️ Build higher technical expertise

Practical Information

  • A mix of slides and, most importantly, guided practice sessions.

  • Computing infrastructure (SSPCloud) provided by Insee to avoid:

    • Installation headaches.
    • Configuration struggles.
  • Setup instructions to follow shortly.

We will be using SSPCloud 😍🐉☁️🇫🇷!

(Quick walkthrough coming up later)

Additional resources

Additional resources

  • Community and french resources (worth a look)
    • utilitR – the most comprehensive r documentation 👶. (tip: great to explore using browser auto-translate).
    • Rzine – best practices and tutorials for social sciences and geography.

General overview

Data proliferation

  • Digitalization and technological innovation have slashed the cost of data production.
    • Exponential growth in the volume of data generated.
  • The use of statistics for governance is not new (cf. Desrosières or Ian Hacking)…
  • … but numbers now hold a central place in public debate and policy-making (Supiot, Davies).

Data diversification (1/4)

Classic tabular data

  • Structured data in table format (rows and columns). Source: Hadley Wickham, R for data science
  • is exceptionally well-equipped for this (handled via dataframes).

Data diversification (2/4)

Geospatial data

  • Tabular data with a spatial dimension.
    • Geography comes in multiple forms: points, lines, polygons…
  • offers powerful tools for this type of data (as long as data volume fits in memory).

Data diversification (3/4)

Textual and unstructured data

  • Historical statistical roots (Levenshtein 1957, perceptron).
  • Rapid development since 2010:
    • Massive collection: social media, open-ended survey questions…
    • Lower storage costs and increased computing power.
    • New techniques: webscraping, Natural Language Processing (NLP), and LLMs.
  • Heavy usage across government, research, and the private sector.

Data diversification (4/4)

Images, sound and video

  • Computer vision and signal processing are now part of the statistical toolkit (e.g., analyzing satellite imagery for agricultural yields).

Emergence of new players

  • Traditional actors:
    • National Statistical Institutes (Insee, BoS Lesotho) and line ministries;
    • Central administrations (Tax authorities, Digital agencies) or mapping agencies (e.g., IGN);
    • More details to follow.
  • Crowdsourced and collaborative projects:
    • OpenStreetMap Lesotho (essential for local infrastructure mapping);
    • Wikidata, OpenFoodFacts

Emergence of new players

  • Private sectors:
    • Vast datasets collected from users and customers (Big Data);
    • New opportunities for data-sharing partnerships (e.g., for research or public health);
    • The challenge: How to integrate these non-traditional sources into official statistics? (ex: UNECE)

The democratization of data

  • The rise of Open Data and Open Source:
    • Global momentum for government transparency (starting in the late 2000s).
    • Rapid growth of national and international open data portals (World Bank, UN, National Platforms).
  • Technological and cultural shifts:
    • Generalization of open and standardized formats.
    • Mass adoption of open-source programming languages (especially Python and R ).
    • Increasing use of APIs for direct data retrieval.

See European Union website

Open (Government) Data refers to the information collected, produced or paid for by the public bodies (also referred to as Public Sector Information) and made freely available for re-use for any purpose. The licence will specify the terms of use. These principles for Open Data are described in detail in the Open Definition.”

Data is everywhere

Data sources and ecosystems

National Statistical Institutes (BoS & Insee)

  • Core missions: Both the Bureau of Statistics (Lesotho) and Insee (France) share the same fundamental goals:
    • Collecting and analyzing vital data: Census, GDP, Inflation (CPI), and Labor Force Surveys.
  • Specific feature: Insee also has a strong mandate for social and economic research to inform public debate.
  • The R Ecosystem:
    • Many NSI datasets are now directly accessible through R packages.
    • This ensures reproducibility: your analysis can be updated instantly when the BoS or Insee releases new data.

National Open Data Portals

  • The Hub: A single platform centralizing data from all public sectors (Health, Transport, Local districts).
  • Transparency: Allows citizens and researchers to find and reuse raw data.
  • The R advantage: Instead of manual downloads, we use R to query these portals directly, ensuring our analysis stays up-to-date.

Why does this matter for this training ?

In the past, we had to “ask” for data. Today, we programmatically access it. Whether you are at BoS Lesotho, Insee, or a mapping agency, R is the bridge between these portals and your analysis.

GitHub : Where the code lives

  • Collaborative platform: A global hub for sharing and hosting code.
  • Much more than just code:
    • Project documentation and tutorials;
    • Professional websites and dashboards (like the ones we can build with R).
  • The home of Open Source and Reproducible Research:
    • Where NSIs (like Insee, Stats NZ, or UK ONS) share their methodology;
    • Allows for transparent and verifiable statistical production.

Why use ?

The principle of an open source language

General principle

Illustration with R

What is R?

  • An open source statistical software:
    • Core language for base operations
    • Packages to extend functionalities
  • Widely adopted across academia and public administrations
  • Extensive online support and resources

Note

  • Created in the 1990s;
  • Massive growth since 2010 (rising alongside Python).
  • RStudio: the next-generation IDE that makes R accessible and powerful for data science.

A “swiss army knife” software

  • Handling all types of data;
  • Data visualization (dataviz), mapping and GIS;
  • Modeling (machine learning, network analysis…)
  • Writing reports, websites, and slides (like these ones 🤓)…

A “swiss army knife” software

You can do everything in R:

Excerpt from R for data science (the bible)

Transparency and reproducibility

  • Traceability of statistics and graphical outputs.
  • Sharing R code ensures methodological transparency:
    • More and more journals now require code submissions!
    • Still some progress to be made in the field.
  • Using R Markdown (or Quarto) increases efficiency 🐢🔜🐇:
    • Eliminates messy intermediate files (text, excel, images…).
    • Saves time on formatting (millions of hours saved, literally).

Note

See the dedicated course on best practices (Insee is highly involved in this topic).

A community of users

  • An open source software:
    • Free and collaborative.
  • Thousands of packages:
    • on CRAN (The Comprehensive R Archive Network).
    • on GitHub.
  • A community driven by open science ideals.
  • A bridge between disciplines: sociology, economics, biology, political science, etc.
  • The power of collaboration: R thrives because users build their own tools and share them with the world.

The RStudio Interface

The four main panes of RStudio

The RStudio Interface

The four main panes of RStudio

  • Source (Top Left): Your script editor. This is where you write and save your code.
  • Console (Bottom Left): Where the code actually runs. You can type commands here for quick tests.
  • Environment (Top Right): Shows your active data, variables, and history.
  • Output Panes (Bottom Right): Where you see your plots, files, and help pages.

Is RStudio really the next-generation IDE ?

  • RStudio is now Posit: The company changed its name to show they support more than just R (Python, Julia, etc.).
  • The rise of VS Code: Many data scientists are moving toward VS Code, a universal editor that is becoming the industry standard.
  • Why change ?: VS Code is faster, supports multiple languages better, and has a massive ecosystem of extensions.

The take-away

Learn RStudio today because it’s the easiest for beginners. But keep an eye on VS Code, it is likely where the future of professional data science is heading.

Why move from RStudio to VS Code ?

  • Polyglot: Work on R , Python , and SQL in the same window.
  • Performance: More lightweight and stable for very large projects.
  • Standardization: It’s the same tool used by software engineers worldwide.
  • Positron: Posit is even building a new IDE based on VS Code technology!

Getting started with SSP Cloud

What is the SSP Cloud?

What is the SSP Cloud?

  • A cloud-native playground for data science (powered by Onyxia)
  • High-performance servers with R and RStudio pre-installed
  • An open space to learn, experiment, and share code

Important: Data Security

This platform is for open data only. Do not upload any confidential or sensitive production data from the BoS. Keep your official datasets on your secured local servers.

Note

More details available in the SSP Cloud documentation

Why use the SSP Cloud ?

  • No installation headache: no need to manage R, RStudio, or packages locally
  • Standardized environment: everyone uses the same versions for reproducible work
  • One-click launch: start a full workspace in seconds (using “Training” buttons)

Creating your account

  • Use your official @gov.ls address at datalab.sspcloud.fr
  • Username rules: No accents, no special characters, no punctuation.

Use a simple format like firstname.lastname. For example, if you are Ntate Stunna, your username could be ntatestunna.

Launching an RStudio service

Quick Guide

Click on Service Catalog on the left menu

Launching an RStudio service

Quick Guide

Keep the default RStudio settings and launch

Launching an RStudio service

Quick Guide

Retrieve your RStudio service password

Launching an RStudio service

Quick Guide

Alternative way to find your service password

Launching an RStudio service

Quick Guide

Log in to your service

Building a Data Community

We want YOU to lead the innovation

  • Create your space: Don’t just code alone. Start a Slack channel, a Teams group, or an internal forum to help each other.
  • Onyxia is a helper, not a cage: It is a facilitator based on open standards. Your code remains yours and stays portable.
  • Continuous Exchange: For those inspired to lead or teach others, we will offer one day of remote follow-up to exchange ideas, troubleshoot together, and help you kickstart your internal community.

Take the lead

The best way to learn is to teach others. Build your BoS community today and we will help you become the next generation of instructors.

Lab 1: Getting started with the language