Hello, this is a video guide to assignment 2.
This assignment is new in the on-demand version of this course, and
there wasn't a corresponding assignment in the previous version of the course.
This assignment corresponds to the work on content-based recommenders.
In our previous course we had an exercise where people used content-based
recommenders, but we felt it was more valuable to have an exercise
where you actually implemented some of the content profiles yourself,
which you can do either by hand or, as we would recommend, in a spreadsheet program.
In this video, I'll give you a brief introduction to the assignment and
show you some of the basics for
manipulating the data that will give you in a spreadsheet.
You have a set of documents and a set of topic content attributes.
And a data set describing which document express which content attributes.
There's a spreadsheet file which I'm bringing up here that you
can open in your favorite spreadsheet.
It will work in Google spreadsheets, it will also work in Excel or
pretty much any other spreadsheet.
And I'm showing it to you with the data in that file.
So if take a look down the left,
you see the names of these very informatively named documents 1-20.
And you'll see across the top ten attributes.
Whether the documents are about baseball, economics,
politics, Europe, Asia, soccer, war, security, shopping or family.
On the right, you have two content profiles.
These represent whether user one or user two saw and
liked, saw and didn't like or didn't see each of the 20 documents.
Come back to that in one second.
We also have near the bottom of the sheet a very simple summary.
DF that counts the number of documents that
each of these concepts or terms was found in.
So by looking at DF we can quickly discover that Europe for
instance appears in 11 of our documents and is the most common concept.
While baseball which appears in only four documents is the least common concept in
these particular articles.
For your purpose you can think of these as news articles that might have been
shown to somebody using a tool like Google News.
One of them is the idea of being able to make a copy of an entire spreadsheet and
work with it again later.
So, If you click here in the top left this works in Excel,
it works just as well in Google and copy I'm doing that by hitting control c but
you could also go to edit, copy and
go to another sheet and then again come in there and just say paste.
There's a one here that says they disliked he or
she disliked this article about family I'll subtract one.
That simply put is a dot product.
I'm going to multiply the two vectors.
If either one is zero and for this purpose a blank is going to be zero I'll add zero.
And I can do that using a function in this spreadsheet called SUMPRODUCT.
Which will be useful for you in this assignment, but also in later assignments.
And if any good spreadsheet will give you a little bit of help, it says,
give me a list of arrays, and it will calculate the dot product for you.
And so, I'm going to come here and say what would I really want is this range.
There's one other trick in spreadsheets that you're going to find very useful.
If you've never used a spreadsheet before,
this is a trick that you'll find helpful all the time.
This cell is lovely, but it doesn't help you if you have to
use this formula over and over and you have to type it in over and over.
What we'd like to be able to do is copy this cell over here And
have it do the same thing but it's not quite going to work.
And the reason it's not going to work exactly right is that in
a spreadsheet when you move something left or right or up or
down it automatically adjusts these things left, right, up or down.
And many of the things we're comparing against are not
going to move automatically.
So if we think about what we want this cell to be when we do the same question
but instead of family, we want to look at shopping.
But I do want the dollar sign before this.
It's not going to change anything in this thing but when I copy it and
paste it suddenly, I'm getting a 1 here.
And if I check this out, I'm going to see, well, no shopping, positive shopping,
no shopping, no shopping, no shopping, all of my shopping adds up to plus 1.
Now I'm going to leave it as an exercise for
you to be able to do the same thing with user two.
But I will tell you that the dollar sign works not only here in
the number on the letters, but
also on the numbers if you wanted to say look I'm going to use other rows.
But I always wanted to use 2 the 21, because that's where my base data is.
That I put dollar signs in there and that will work fine.
And then, we're going to ask you to figure out which articles are better or
worse, based on that taste profile.
And one simple way to do that would be to multiply
this row of tastes against each document row.
Well you know how to do that.
That's a dot product.
You can use some product.
And I'll see that there's a correlation of 0.21.
And that function is built in as well.
Just so that you see it at the top, that's correlation.
And you give it the two vectors you want to correlate.
They can be vertical, they can be horizontal.
Okay, back to the assignment.
You're going to be asked to do three things in this assignment.
and we're looking simply for correctness.
Along the way we'll give you some intermediate results to help you make sure
that you're doing things the correct way.
If you have questions about using spreadsheets or
questions about the assignment that perhaps other students or we can
help you with there will be a discussion thread attached to this video and
we invite you to Post your questions right in the discussion thread.
Normalization Database normalization can principally be cleared as the practice of optimizing table structures. Optimization is adapted as a result of a thorough investigation of the numerous parts of data that will be stored within the database. An evaluation of this data and its consistent relationships is beneficial because it can result both in a considerable improvement in the speed in which the tables are demanded and in decreasing the chance that the database integrity could be impaired due to wearying maintenance practices. In first normal form, all entities must have a unique identifier, or key, that can be composed of one or more attributes. In addition, all attributes must be atomic and non-repeating. The first normal form (1NF) consists of removal of repeating groups. In this case the repeating group is contacts and category. For each given customer, one or more contacts and one or more categories can occur. For each one of the repeating group I