Fraud detection, micro-clusters and scatterplots

NO LABELS	OUR LABELS	GROUND ’TRUTH’
	(diamonds/crosses)	(diamonds/crosses)


(a) micro-cluster	(b) our guesses	(c) ground-truth
in-degree vs in-weight	core-number vs IAT	core-number vs IAT



in- vs out- degree

(d) raw data	(e) w/ micro-cluster(s)	(f) ground-truth

3.1) Raw data: Figure 1 (a) and (d)

The first column shows two of the many scatter-plots we are proposing: The first row ’(a)’ plots the in-degree (total # of friends calling in) versus weighted in-degree, in log-log scales since we have power-laws as we expected. Like all other scatter plots, this is a heatmap - notice that even the scales of the heatmap are logarithmic: The vast majority of subscribers, in red/orange colors, receive a few phonecalls ( ≈ 10¹), and the total talking time is relatively short ( ≈ 10²). Not surprisingly, there is a correlation: the more people call you, the larger the total duration of the phone calls. What is surprising is the micro-cluster inside the white circle, at (10², 10⁴): Many people ( ≈ 10³ - light-yellow colormap) have surprisingly similar in-degree of about 100, and similar total duration of about 10,000 seconds.

Let’s dissect (d), the scatter-plot below it. Every dot is a subscriber; the axis are the in- and out-degree of each subscriber, again in logarithmic scales (log (x+1), to be exact). There is heavy overplotting (red/orange/yellow colors) for small values - that is, most subscribers receive and initiate only a few phonecalls. There are two un-expected issues: The first is that there is little reciprocity: analysis of earlier who-calls-whom networks [1] reported that usually people who make x phonecalls, also receive about x phonecalls. Thus, we would expect to see a strong diagonal along the 45 degree line - but it is not there.

The second observation is that there are a lot of people who never return a phonecall (all the dots on the horizontal axis - our domain experts call them ’black holes’), and similarly many people that receive no incoming phonecalls (’volcanoes’). These behaviors are not necessarily bad: for example, an emergency room or a help desk would behave like a ’black hole’. However, several fraudulent behaviors are so asymmetric, like, eg., scammers (’you owe taxes’).

3.2) Our Guesses (b), (e)

The second column of Figure 1 shows our guesses for fraudsters (orange diamonds and crosses, for ’volcanoes’ and ’black-holes’ respectively). The crucial plot is (b), which shows the ’core-number’ versus inter-arrival time (IAT). The core-number has a complicated definition, ¹ but the intuition is that nodes with high core number belong to a densely connected community.

In Figure 1(b) notice that, while the vast majority of nodes (red/yellow dots at the bottom-middle of the graph) have low core-number (2-5), there are several with very high core number (≈20). Plotting them on the in- vs out-degree plot (’(e)’), about half of them are ’volcanoes’ (diamonds on the horizontal axis) and the rest are ’black-holes’ (crosses, on the vertical one). We shall refer to them as leads from now on.

It turns out that most of our leads actually call each other, forming a very dense bi-partite core; such subgraphs are typically fraudulent in social networks (like Twitter [3] , FaceBook [2] )

3.3) Comparison with the ’ground truth’

Our dataset already had labels, and we show them in the last column of Figure 1(c) and (f). Notice that (c) and (f) are exactly parallel to the middel column ((b) and (e)), with the only difference that the ’ground truth’ column has the labeled nodes in red. As in the ’leads’ case, diamonds and crosses correspond to ’volcanoes’ and ’black-holes’ respectively.

This is exactly the reason that we call the labeled set as ’ground truth’ within quotes: Several of the fraudsters may be flying below the radar, as we showed with our ’leads’ of Figure 1(b) and (e).

4) Conclusions

Citations

Formally, the k-core of a graph is a maximal subgraph that contains nodes of degree k or more; the core-number of a node is the largest value of k of a k-core that contains that node.↩︎

0) Acknowledgements

1) Reminders - Problem definition and past insights

2) New Insights

3) Deep dive

3.1) Raw data: Figure 1 (a) and (d)

3.2) Our Guesses (b), (e)

3.3) Comparison with the ’ground truth’

4) Conclusions

Citations