Building a smart URL list system: Policy for URL prioritization
Maria Xynou
2020-04-15
To improve the monitoring of website censorship around the world, OONI
aims to create a smart URL list system, while ensuring, to the extent
possible, the safety of the URL lists themselves by running them through
the usual Citizen Lab URL review process. This will help
ensure smarter test target selection and by extension, it will enable us
– and the broader internet freedom community – to more effectively
monitor, analyze, and respond to cases of website censorship around the
world.
This document describes OONI’s policy for URL prioritization. The
goal of this policy is to determine the criteria based on which the
OONI Probe testing of certain types of
URLs will be prioritized over others. Through URL prioritization, OONI
aims to optimize the value of collected measurements, ensure regular
testing of the same URLs for consistency, ensure that the tested URLs
are relevant to OONI Probe users, and to improve the monitoring of
website censorship around the world.
Summary
Even though thousands of websites are measured by tens of thousands of
OONI Probe users in more than 200 countries every month, detecting the
blocking of websites and collecting enough measurements to confirm
blocking with confidence remains an ongoing challenge. Blocked URLs are
sometimes not tested frequently enough (or at all), limiting the
coverage of censorship events, rapid response, and relevant advocacy
efforts.
To solve this problem, OONI aims to create a system for “smarter” URL
testing. With the smart URL list system, the OONI Probe testing of certain categories of URLs
would be prioritized over others, in order to improve the monitoring of
website censorship around the world.
URLs will be prioritized for testing depending on whether they are of
public interest, whether their blocking could impact human rights, and
whether they fall under a category that is frequently blocked around the
world (particularly in correlation to political events).
Country-specific criteria may apply too on a case-by-case basis.
In every case, the smart URL list system will only prioritize URLs that
are already included in the Citizen Lab test lists and which have
therefore been reviewed by the community and vetted in terms of safety.
Background
Why measure website censorship
Website blocking remains an ongoing - and increasingly worsening -
problem, often affecting marginalized communities the most.
Hundreds of media websites and human rights websites are blocked in
countries like
Iran
and Egypt.
Independent media organizations in Egypt
report
that they are forced to shut down their operations entirely, as a result
of ongoing, persistent blocking of their websites.
Amid Venezuela’s economic and political crisis, numerous independent media websites have been blocked,
along with several blogs expressing political criticism.
Last year, Wikipedia was not only blocked in Venezuela,
but all language editions of Wikipedia are now blocked in China as well.
Last month, the Farsi language edition of Wikipedia was temporarily blocked in Iran.
Minority group sites remain blocked in numerous countries around the
world. LGBTQI sites are blocked in countries like
Indonesia,
Iran,
Ethiopia, and
Malaysia,
the sites of the Baluch and Hazara ethnic minorities are blocked in Pakistan,
while the sites of the Baha’i religious minority are blocked in Iran.
Meanwhile, the blocking of websites is increasingly becoming more
sophisticated around the world. Cuba, for example, used to primarily serve blank block pages, only blocking the HTTP version of websites. Now they censor access to
websites that support HTTPS by means of IP blocking. Venezuelan ISPs used
to primarily block sites by means of DNS tampering.
Now state-owned CANTV implements SNI-based filtering
as well.
All of the aforementioned cases have been detected through the use of
OONI Probe and reported based on OONI censorship measurement data. However, our
ability to effectively track and respond to the blocking of websites
(and other internet censorship events) around the world is still rather
limited.
Current OONI Probe limitations to URL testing
The OONI Probe mobile app (the most widely
adopted OONI Probe testing client) tests a random selection of URLs
taken from the
global
and
country-specific
(based on the country that the user is running OONI Probe from) Citizen
Lab test lists.
Due to bandwidth constraints, the default is that OONI Probe will only
measure as many URLs as it can connect to within 90 seconds (users can
extend the test runtime in the app settings, but this feature is not
widely used).
This inevitably means that OONI Probe URL testing presents the following
limitations:
Blocked URLs may not get tested at all in a given country, if they don’t happen
to be randomly selected as part of the default testing;
Blocked URLs may only get tested a few times in a given country,
often leading to inconclusive results (particularly if a block page is not served and if false positives emerge);
URLs may not get tested during a time-frame when they’re temporarily
blocked (i.e. URLs were tested before and/or after their blocking,
but not during);
Many URLs that are of less public interest and which have never been
blocked may randomly get selected and get tested more frequently,
while other URLs of greater public interest and which are censored
may not get tested as frequently (or at all);
The random testing of URLs limits the ability to track and evaluate
censorship changes over time (i.e. the blocking and unblocking of
URLs).
In summary, the random testing of URLs presents challenges to the
testing of blocked URLs (and potentially means that blocked URLs are
often missed), limiting the coverage of censorship events, rapid
response, and relevant advocacy efforts. It also limits the internet
freedom community’s ability to identify censorship trends and changes
over time, since URLs may not be tested consistently over time.
Smart URL list system
To solve this problem and improve the monitoring of website censorship
around the world, we aim to build a system for “smarter” URL testing.
Based on this new system, OONI Probe users would no longer test URLs
(included in the Citizen Lab test lists)
randomly. Rather, the testing of certain categories of URLs would be
prioritized over others, in order to improve the monitoring of website
censorship around the world.
Goals
The underlying goals and principles behind URL prioritization
involve:
Responding faster to emergent censorship events;
Expanding the breadth and granularity of global coverage of website
censorship;
Optimizing the value of collected measurements;
Ensuring the regular testing of the same URLs for consistency and to
support data analysis efforts;
Ensuring that the tested URLs are more relevant to OONI Probe users.
We will adjust URL priorities based on the above goals and URL
priorities will be transparent. We will openly display which URLs are
prioritized for testing and we will provide the internet freedom
community the option to offer suggestions.
In every case, the smart URL list system will only prioritize URLs that
are already included in the Citizen Lab test lists and which have
therefore been reviewed by the community and vetted in terms of safety.
Criteria for URL prioritization
As part of the smart URL list system, the testing of URLs will be
prioritized based on specified criteria. Some criteria will apply to all
OONI Probe users globally, while other criteria will differ from country
to country. Below we share the main criteria for each.
Global URL prioritization criteria
The testing of URLs by OONI Probe users globally will be prioritized
based on the following criteria:
Public interest. URLs that host content or offer services that
are of public interest will be prioritized. Whether a URL or
category of URLs is of “public interest” will be determined based
on whether the censorship of such information could have an impact
on the general public (because it relies on this information).
News media, for example, is generally considered to be of public interest,
which is why its testing will be prioritized.
Impact on human rights. Our goal is to defend human rights on
the internet. We will therefore prioritize the testing of human
rights sites and other sites whose potential blocking could have
an impact on human rights.
Frequently blocked around the world. Social media is an example
of online content that is frequently blocked in countries around the
world, particularly during political events, such as elections or
protests. We will prioritize the testing of URLs if they fall
under a category that has commonly been blocked around the world
(such as social media, news media, and VPNs), particularly in correlation to political events.
Country-specific URL prioritization criteria
The testing of URLs by OONI Probe users may differ from country to
country. In addition to the global URL prioritization criteria,
country-specific URL prioritization may apply too based on the following
criteria:
Reportedly blocked URLs. If a specific website or type of
content is known to be blocked or reportedly blocked in a country
(according to news articles, research reports, local accounts, or
other third party resources), its testing may be prioritized. This
may include a certain type of content (such as gambling) that is
illegal/banned in a specific country. We are cognizant of the
increased potential risk
associated with testing illegal content, and will therefore
evaluate whether the testing of such content should be prioritized
based on input from local communities and country experts.
Likelihood of being censored. If certain types of URLs are
likely to be blocked (now or in the future) due to their
provocative content, their testing may be prioritized in a
specific country. For example, this may include blogs and other
websites that express political criticism.
Correlation to political events and potential for censorship.
Over the years, we have observed a
strong correlation between political events and the spike in
censorship events around the world. We may therefore prioritize
the testing of certain types of websites if they are likely to get
blocked in correlation to specific political events. For example,
this could involve the prioritized testing of election watchdog
websites leading up to, during, and shortly after an election.
The above country-specific criteria require local knowledge and
expertise. They will therefore mainly be applied when and if we receive
relevant advice and recommendations from local experts.
Overall, we may revise the above criteria in the future, particularly
once the smart URL list system is rolled out and we have seen how it
works in practice. We may also make changes based on community feedback
and suggestions, and adjust URL priorities over time. Any future changes
to the URL prioritization criteria will be reflected through an update
to this policy.
URL prioritization
Citizen Lab test list categories
OONI Probe measures the URLs included in the Citizen Lab’s
global
and
country-specific
test lists. These URLs fall under 30 broad categories,
which range from news media and human rights, to more objectionable
categories, such as hate speech and pornography.
However, these categories don’t carry the same weight in terms of public
interest and the possibility of being censored. News media, for example,
is probably of greater public interest than URLs that fall under the
gaming category. In certain countries where LGBT rights are not
recognized, for example, the blocking of LGBT sites might be more
probable than the blocking of URLs that fall under the e-commerce
category.
Therefore, the new smart URL list system will implement backend logic
for prioritizing the testing of certain URL categories over others.
The prioritized testing of URL categories that are of greater public
interest is especially important for OONI Probe mobile app deployments,
as it makes it possible to save up on bandwidth by prioritizing the
testing of more relevant URLs.
Emergent censorship events
In response to emergent censorship events, the smart URL list system may
prioritize the testing of URLs that are reported (for example, by the
news media or local community members) to be blocked. However, this
prioritization will be limited to URLs that are already included in the
Citizen Lab test lists
(and have therefore been reviewed and vetted).
If, for example, popular social media platforms – such as facebook.com
and instagram.com
– are reportedly blocked in a certain country, the
smart URL list system would enable us to prioritize the testing of
facebook.com
and instagram.com
by OONI Probe users in that country.
Practically, this means that if you are an OONI Probe user in said
country, when you tap/click “Run” in the OONI Probe app (without
specifying the URLs that you’re testing), facebook.com
and instagram.com
would be amongst the first URLs you would test. The prioritization of
URLs may be re-adjusted once/if an emergent censorship event has ended
(for example, once access to facebook.com
has been unblocked in a
certain country).
OONI data analysis
With smart URL selection capabilities, we eventually aim to have the
ability to dynamically determine and adjust the testing targets based on
input from OONI data analysis.
Our new fast-path data processing pipeline automatically
analyzes and publishes OONI measurements from around the world in near
real-time. This analysis can help flag censorship changes, such as the
new blocking or unblocking of specific URLs. It can also flag the
presence of anomalies for URLs that are of public interest, signaling
the potential presence of past, emergent, or ongoing blocking. All this
information can potentially feed into our new smart URL list system, in
order to help inform which URLs we should prioritize testing for.
OONI Run is a platform that is used by OONI
Probe users around the world to coordinate the testing of specific URLs
– particularly leading up to and during political events (such as
elections or protests), and in response to emergent censorship events.
Many of these URLs may be interesting to test on an ongoing basis, but
may not already be included in the Citizen Lab test lists.
We therefore aim to mine OONI data to identify URLs that have been
tested by OONI Probe users independently, add those URLs to the Citizen
Lab test lists, and prioritize the testing of certain URLs if they meet
the relevant criteria (as discussed previously).
Push notifications to solicit testing
Ensuring the prioritized testing of URLs that are of public interest is
not enough. We often have limited testing coverage of URLs of interest,
limiting our confidence in ruling out false positives and
confirming censorship events (especially if block pages
are not served).
To increase testing coverage, we will add support for configuring push
notifications to solicit testing. We will also add support so that OONI
Probe mobile app users can receive push notifications and run
experiments. This will be particularly useful during emergent censorship
events when fast coordination of targeted URL testing is crucial.
Analysis and publication of measurements
As the OONI software ecosystem is designed to automatically publish all measurements that are sent to OONI
servers, the internet freedom community and the public at large will
benefit from the more sophisticated testing of the smart URL list
system. Members of the internet freedom community and anyone from the
public will be able to share feedback on which URLs should be
prioritized.
To ensure that the measurements are more actionable, we are developing
data analysis capabilities aimed at examining results from a
website-centric perspective. This involves data analysis and pipeline
work necessary for extracting website metrics, as well as adding data
export capabilities for website-related metrics.
Call to Action
Review URLs included in the Citizen Lab test lists
Which URLs are prioritized for OONI Probe testing depends on which URLs
are included in the Citizen Lab test lists.
This means that if certain URLs are blocked or otherwise interesting to
test, but they are not included in the relevant Citizen Lab test lists,
they will not be tested by OONI Probe users and relevant OONI data will likely not be available.
We therefore encourage URL contributions
to the Citizen Lab test lists.
Review URL categorizations
The URLs included in the Citizen Lab test lists are categorized based on
a set of 30 categories,
and the OONI smart list system will prioritize testing based on these
categories.
This emphasizes the need to ensure that URLs in the Citizen Lab test lists are
categorized as accurately as possible. Your help in reviewing URL categories in the Citizen Lab test lists
(and changing any inaccurate categorizations) would be greatly
appreciated!