import pandas as pd
import numpy as np
import functools
from matplotlib import pylab as plt
%matplotlib inline
!tshark -2 -r mads-141.0.174.38.pcap -Tfields \
-eframe.time_epoch \
-eip.id -e ip.ttl -e ip.dst \
-etcp.srcport -etcp.dstport -etcp.len -etcp.stream -etcp.seq -etcp.ack -etcp.flags.str -etcp.window_size \
-etcp.options.timestamp.tsval -etcp.options.timestamp.tsecr \
-etcp.analysis.initial_rtt -etcp.analysis.ack_rtt \
-ehttp.host -ehttp.response.code -ehttp.user_agent -ehttp.location \
> mads-141.0.174.38.txt
data = pd.read_csv('./mads-141.0.174.38.txt', delimiter='\t',
names=
('time_epoch id ttl dst srcport dstport len stream seq ack flags window_size '
'tsval tsecr initial_rtt ack_rtt host code user_agent location').split())
data['intid'] = data.id.apply(functools.partial(int, base=16))
data.time_epoch -= data.time_epoch.min()
data.sample(1).T
len(data.stream.unique())
data.host.value_counts()
www.xnxx.com
stream is not really interesting as it's just a single webpage that occasionally hit the dataset.
data.location.value_counts()
There are three different sorts of streams in the dataset. Good streams get correct redirect, bad streams get injection, ugly streams get no redirect at all.
s_good = set(data[data.location == 'http://www.xnxx.com/'].stream)
s_bad = set(data[data.location == 'http://marketing-sv.com/mads.html'].stream)
s_ugly = set(data[data.host == 'xnxx.com'].stream) - set(data[~data.location.isnull()].stream)
print 'Bad:', sorted(s_bad)
print 'Ugly:', sorted(s_ugly)
data[(data.stream.isin(s_good)) & (~data.code.isnull())].ack_rtt.hist(color='green', normed=True)
data[(data.stream.isin(s_bad)) & (~data.code.isnull())].ack_rtt.hist(color='red', normed=True)
plt.xlabel('RTT, s'); plt.ylabel('Density');
There are not so much streams outside of 250ms range, so let's look at high-res histograms at that range.
data[(data.stream.isin(s_good)) & (~data.code.isnull())].ack_rtt.hist(bins=np.linspace(0, 0.25, 25), color='green')
plt.xlabel('RTT, s'); plt.ylabel('Count');
data[(data.stream.isin(s_bad)) & (~data.code.isnull())].ack_rtt.hist(bins=np.linspace(0, 0.25, 25), color='red')
plt.xlabel('RTT, s'); plt.ylabel('Count');
We have only six samples of injected redirects, but five of these samples have significantly lower RTT than the usual RTT of http response. The ~45ms RTT corresponds well to the latency of last-mile ADSL link that was used during this analysis, so it is close to latency to get a packet from ISP's network.
d_good = data[data.stream.isin(s_good) & (data.srcport == 80)]
d_good[d_good.tsval.isnull()].shape
So, good data from server always has TCP timestamp, there are no rows in the slice.
d_good.ttl.value_counts()
So, good data from server always has TTL=54
.
Let's look at IP fragment ID field:
dsa = d_good[d_good.flags == '*******A**S*']
plt.scatter(dsa.time_epoch, dsa.intid, marker='.')
plt.xlabel('Time since 1st packet, s'); plt.ylabel('IP ID')
plt.title('IP ID for SYN-ACK packets from server');
dsa = d_good[d_good.flags != '*******A**S*']
fig = plt.figure(); fig.set_figwidth(15); fig.set_figheight(3)
ax = fig.add_subplot(1, 2, 1)
ax.scatter(dsa.time_epoch, dsa.intid, marker='.')
ax.set_xlabel('Time since 1st packet, s'); ax.set_ylabel('IP frag. ID')
ax.set_title('IP ID for non-SYN-ACK packets from server');
ax = fig.add_subplot(1, 2, 2)
dsa.intid.hist(bins=32, ax=ax)
ax.set_xlim(0, 2**16)
ax.set_xlabel('IP frag. ID'); ax.set_ylabel('Packets')
ax.set_title('IP ID for non-SYN-ACK packets from server');
print 'Min/Max IP ID observed for non-SYN-ACK packets:', dsa.intid.min(), dsa.intid.max()
Good server replies with IP-ID=0
in SYN-ACK
and almost never has IP-ID=0
in other packets, IP-ID is rather random for other packets.
plt.scatter(d_good.time_epoch, d_good.window_size)
plt.xlabel('Time since 1st packet, s'); plt.ylabel('TCP Window, bytes');
plt.title('TCP Window announced by server');
Good server has window size in 25k…31k range (scaling is applied).
print sorted(s_bad)
d_bad = data[data.stream.isin(s_bad) & (data.srcport == 80)]
d_bad['time_epoch stream id ttl len seq ack flags window_size tsval ack_rtt'.split()].sort_values(by=['stream', 'time_epoch'])
So, all packets that look-like-injected have:
The server also sends 408 Request timeout
in 120 seconds. It means, that the server has not seen the request at all, so the injector act as an in-band device.
Also ACK that is confirming FIN-ACK is looks like injected according to IP ID and TTL, but it has weird RTT (~98ms, but not ~44ms).
That's injected stream that has ~200ms latency. On the other hand, genuine SYN-ACK from the server also has larger-than-usual RTT, so it's probably just a temporary Bufferbloat lag.
d_bad[d_bad.stream == 287]['time_epoch stream id ttl len seq ack flags window_size tsval ack_rtt'.split()]
The interesting thing about this stream is that ACK confirming FIN-ACK has 18ms ACK_RTT, so it actually means that the packet was likely sent BEFORE seeing the FIN from the client as the last-mile RTT is ~38ms according to mtr
measurements.
If the statement is actually true, then another question arises: why is ACK-confirming-FIN-ACK usually ~98ms delayed? Is it triggered by some packet from original server? Is it sort of latency camouflage? No further research was done yet to clarify these questions.
d_ugly = data[data.stream.isin(s_ugly) & (data.srcport == 80)]
d_ugly.groupby(by='stream tsecr'.split()).time_epoch.agg(['count'])
It means, that the remote server has seen SYN
packet and the first ACK
after the SYN
, but the server has never seen the request itself. It suggests that the ugly streams are just a sort of bad streams those got no redirection packet for some reason.
It's interesting that only mobile User-agents were redirected to the …/mads.html
. Our test sent ~33% of requests using mobile User-Agent and 67% of requests using desktop User-Agent.
d_goo = data[data.stream.isin(s_bad | s_ugly)]
d_goo.user_agent.value_counts()
It explains why OONI dataset sees no redirection. We've seen redirections only for mobile User-Agent
so probably the DPI targets mobile users.
print 'Redirection happens in %.1f%% cases' % (100.*len(s_bad|s_ugly) / len(set(data.stream)))
print 'Redirection happens in %.1f%% of mobile cases' % (100.*len(s_bad|s_ugly) /
len(set(data[data.user_agent.str.match('.*(?:Android|RIM|Symbian|Series60|iPhone|BlackBerry|MIDP)', as_indexer=True) == True].stream)))