Median
We filtered our brightness graph and got something like this.
We would like to know around what kind of average these points are fluctuating. For that we are calculating the median.
Calculation
Let's say we have a sequence of values \(y_{0}, y_{1}, \ldots, y_{n-1}\). The median of these numbers is defined as follows.
- Sort the numbers into a sequence \(z_{0}, z_{1}, \dots, z_{n-1}\).
- From this sorted sequence, pick the middle number. If there is no middle, take the average of the middle two.
Lets work out an example. Take a look at the following example
\[ 31, 41, 59, 26, 53, 58, 97, 93, 23, 84 \]
If we sort this sequence we get
\[ 23, 26, 31, 41, 53, 58, 59, 84, 93, 97 \]
Because there are an even number of values, we should take the average of the of the two middle values. The average of \(53\) and \(58\) is \(\frac{53 + 58}{2} = \frac{111}{2} \approx 55.5\).
Make a library.
Because we are going to use the median several times, we are going to create a
library. Let's start with our lib.rs
.
In our lib.rs
we are announcing a module called median
.
# #![allow(unused_variables)] #fn main() { pub mod median; #}
There are different ways to create this module. Either creating a median.rs
file inside the src
directory. Or creating a median
directory inside the
src
directory, which contains a mod.rs
file. Which every you choose, let's
implement a median_of
function.
Our median_of
function will have a &Vec<f64>
as parameter and return the
median f64
. Once we have a sorted copy of the data called copy
, getting the
median comes down to determining if the number of elements is even or odd, and
performing the right calculation.
# #![allow(unused_variables)] #fn main() { let n = data.len(); let middle: usize = n / 2; let median = if n % 2 == 1 { copy[middle] } else { (copy[middle] + copy[middle - 1]) / 2.0; } #}
But how do we sort the original data?
Sorting side-quest
There are a few interesting tidbits when sorting a Vec<f64>
that we are going
to make a side-quest out of it. While looking into Vec<T>
documentation,
you can come across the method sort
. Let's see if we can use it.
# #![allow(unused_variables)] #fn main() { let mut vs: Vec<f64> = vec!(3.0, 2.0, 1.0); vs.sort(); #}
Unfortunately this doesn't compile.
Compiling playground v0.0.1 (file:///playground)
error[E0277]: the trait bound `f64: std::cmp::Ord` is not satisfied
--> src/main.rs:6:4
|
6 | vs.sort();
| ^^^^ the trait `std::cmp::Ord` is not implemented for `f64`
error: aborting due to previous error
error: Could not compile `playground`.
To learn more, run the command again with --verbose.
Which could come as a surprise. The Ord
trait determines an ordering of
elements. Certainly we can determine whether `0.0 < 1.0``?
# #![allow(unused_variables)] #fn main() { assert!(0.0f64 < 1.0f64); #}
Luckily we can. So what is going on? Rust has two related traits for comparison:
PartialOrd
and
Ord
. The main difference
is that Ord
is supposed to be total. I.e. any type that implements the Ord
trait should be able to compare any pair of values that have the type.
In other words, if you implement the Ord
trait you should be able to answer
yes to one and only one of the following questions with for values a
and
b
in the type:
- Is
a < b
? - Is
a == b
? - Is
a > b
?
The problem with f64
is that is implements IEEE-754, the standard for
arithmetic with floating point numbers. This standard defines a value NaN
,
not a number, which is not comparable with any other value.
So f64
can not be complete and follow the standard at the same time.
Fortunately PartialOrd
is implemented for f64
. So as long as we do not
compare with NaN
s, which we don't intend to do, we should be safe.
Back to sorting, the sort
method expects that the Ord
is implemented, so we
can not use it. Vec<T>
also has a sort_by
method, that allows to pass a compare
function. We can use this to our advantage by relying on the PartialOrd
trait.
# #![allow(unused_variables)] #fn main() { let mut vs: Vec<f64> = vec!(3.0, 2.0, 1.0); vs.sort_by(|a, b| a.partial_cmp(b).unwrap()); println!("{:?}", vs); #}
This correctly sorts our vector. But notice that the vs
variable is declared
mutable. Our signature doesn't expect to have a mutable reference, so we need to
copy our data
first.
Copying Data
We need a mutable copy of our data
. Luckily the Vec<T>
API provides an other
method; copy_from_slice
. We use it as
# #![allow(unused_variables)] #fn main() { let n = data.len(); let mut copy = vec!(0f64; n); copy.copy_from_slice(&data); #}
This is the final piece in the median puzzle. We are able to put everything
together and write our median_of
function.
Form Groups
We do not want to calculate the median of our entire sequence. Instead we want to move a sliding window over our data and calculate the median of that specific window.
For that we need to group our data. Let's create that function.
# #![allow(unused_variables)] #fn main() { fn groups(data: &Vec<f64>, group_size: usize) -> Vec<Vec<f64>> { let mut groups: Vec<Vec<f64>> = vec!(); for end_index in group_size .. data.len() + 1 { let mut group: Vec<f64> = vec!(); for index in (end_index - group_size) .. end_index { group.push(data[index]) } groups.push(group) } groups } #}
Median Filter
We are now in the position to create a median_filter
function. I.e. a function
that calculates the median of a sliding window over our data. With all of our
preparations it writes itself as
# #![allow(unused_variables)] #fn main() { pub fn median_filter(data: &Vec<f64>, window: usize) -> Vec<f64> { groups(data, window) .iter() .map(median_of) .collect() } #}
With our library all done, we can start out processing proper.
Processing
But wait! Our data arrives as f64
-pairs, i.e. (f64, f64)
, and we create
median_filter
to operate on a single f64
value. Did I lead you down a wrong
path?
Not entirely. Once again the standard library, in the form of the Iter
trait,
has a trick up their sleeve. It comes in the pair of methods zip
and unzip
.
You can find their signatures below.
With unzip
you can take a sequences of pairs and return a pair of sequences.
zip
goes the other way.
Let's see how we can use them. After getting the raw data, we can use unzip
to
extract the individual components.
# #![allow(unused_variables)] #fn main() { let (times, values): (Vec<f64>, Vec<f64>) = raw .iter() .cloned() .unzip(); #}
The cloned
call is because we need to take ownership of our data. Next we can
use our median_filter
from our own library. Make sure to reference our own
external crate and import the correct function.
# #![allow(unused_variables)] #fn main() { let median_times = median_filter(×, window_size); let median_values = median_filter(&values, window_size); #}
Finally we can zip together these two vectors again to get our result.
# #![allow(unused_variables)] #fn main() { let result = median_times.iter().zip(median_values); #}
Storing this into a CSV file makes it available for the next step.
Further Considerations
You have created a library that contains some functions. How do you know that they are implemented correctly? Try to add some tests that increases your confidence in your code.
The median_filter
accepts an window_size
argument. What is a good value?
Why haven't we used same the method we used to detrend the data?