Saturday, October 22, 2005

A Puzzle for Techies: Did 'Metadata' and 'Hash' Show that Files were Copied?

A puzzle for techies:

In an RIAA case a federal judge said that a screen shot of defendant John Doe Number 7's Kazaa shared files folder was enough to show copyright infringement because

"[RIAA] obtained "metadata" about the files that Doe No. 7 was disseminating, which often reveal who originally copied a particular sound recording from a CD to a computer disk (a process called "ripping") and provide a type of digital fingerprint, called a "hash," that can show whether two users obtained a file from the same source.... Using the metadata associated with the music filed that Doe No. 7 was offering for distribution on Kazaa, plaintiffs have determined that many sound recordings were ripped by different people using different brands of ripping software. Such information creates a strong inference that Doe No. 7 was not simply copying his or her own lawfully purchased CDs onto a computer, but had downloaded those files from other P2P users."

Was the judge right or wrong?

The case was Elektra v. Does 1-9

32 comments:

Anonymous said...

over the last 5 years I have used 3 different operating systems, 2 different processors, and at least 6 different ripping programs to rip my musical cds (I have over 600 legally purchased cds - used to be a dj in college). Additionally, I have downloaded music legally from eMusic.com which has gone through at least 2 different ripping systems. I'd hate to see what my metadata and hash(s) look like - I probably would like a thief. The data will show different OS and ripper information (in some cases) but not necessarily for illegal reasons.

Anonymous said...

Many ripping programs will attach metadata to audio files, though there are some (mostly open source) ones that will not. I'd hesitate to say that it is a true "digital fingerprint", though it is possible that slight variations in the implementation of the encoding algorithm could have a similar effect, I am not aware of any meaningful data that can be gathered that way.

It is most certain that there is no accurate way to determine who created the file. No useful identity gets maintained through transfers.

Further, it's not illegal to download a file from someone else if you already have a license for it, so some of those files could have been downloaded, but only because they chose not to encode themselves.

Anonymous said...

If two different copies of the same CD are ripped using the same cda-2-mp3 encoding algorithm at the same bit-rate, then the file's contents and consequentially the hash will be exactly same.

What this basically means is, you can produce identical files on two different machines from two different copies of the same CD. So the argument that a hash is universally unique is completely flawed. If the two file's contents match, then the hash will be the same, it's that simple.

When they say Metadata, i'm assuming they are talking about an mp3's ID3 Tag information.

This information can be changed by anyone extremely easily, for instance with winamp. What this proves is that Metadata can be duplicated many times, in many different files, ergo it's not uniquely identifying.

Finally the Metadata and Hashes can be reproduced from the same music album/single's, which proves it's not unique. If it's not unique, they can't differentiate between an original and a copy.

Besides this, there is no way to tell the difference between a perfect copy and an original, so how can they claim to tell the difference? I think we need to find out more information about the Metadata they are using.

A good example of this would be to buy two copies of the same single from the same music store, rip each one onto two different PC's using the same encoding software, for example, winamp(with the same settings – bit-rate etc).

The file contents should be identical in theory, so if you hash them, they should produce the same values. And as i said, the Metadata contents can be normalized so they are identical.

Now, not trying to appear arrogant, but how exactly is this judge competent enough technically to decide if the RIAA's argument is valid? Surely the court should appoint one or more independent technical advisers to assess the validity of the plaintiffs claim?

Anonymous said...

And even if by some miracle the two different copies from the same store produce slightly different hashes, there is still no way to determine who's copy is the original and who's is the copy just by analysing the hash and metadata. Which as i said, would be a perfect copy.

Anonymous said...

How on earth can a screenshot be used as evidence in copyright infringement?

I could create a set blank files named as copyrighted material, stick them into a folder and take a screenshot, then claim they were being shared by somebody. Isn't there some kind of standard for quality of evidence?

It's a sad state of affairs if something so weak as a screenshot can be used to prove copyright infringement.

Anonymous said...

No, you cannot tell that files were copied just by viewing the metadata/hash of said files. A hash (such as MD5 and CRC) has a certain statistical chance of being duplicated, so it's possible for someone with an actual copy of "Britney.Spears.mp3" to have an actual recording, and another person with the same filename and hash to have a file filled with garbage (or something entirely different, like "Kenny.Rogers.Greatest.Hits.mp3").

In fact, the RIAA should be accutely aware of how easy it is to make files that appear legitimate (have the correct hash and metadata) but are actually garbage because they routinely hire firms to flood P2P networks with trash files that appear to have a correct hash/metadata set, but are in fact garbage.

Metadata in the form of ID3 tags are also "spoofable" (by both the RIAA and consumers/end users). Like one time I was trying to download that Tommy Lee and Pamela Anderson video, and all I got was some hentai anime... :(

The only way they can actually prove infringement occurred would be if they downloaded the file from the accused file sharer, and then that wouldn't be infringement because a copyright holder (nor their agents?) can infringe their own copyright.

Jonathan said...

Anonymous nailed it, above. If you start from the same source (an audio CD) and use the same "ripping" program with the same settings, the result should be identical and the hash should also be identical. It's a mechanical, deterministic process.

What's more, many such "ripping" programs come with default settings, and some programs (like "Exact Audio Copy" and the "LAME" mp3 encoder) have profiles available for them that will set all the "ripping" settings for you. If two people use the same settings on the same program with the same CD, they'll get identical output.

The short version: identical hashes in no way uniquely identify the source of a song.

It is possible to make a case that unusual metadata (e.g. a particular person's favorite moniker inserted manually in the mp3's "comments" field during the ripping process) is unlikely to have been independently entered by two different people, but you'd need to do more than simply assert it.

Jonathan said...

One more thing --

The metadata on songs is also generally unreliable as a marker of the source of a recording. Aside from the "unusual material" I noted above, some "ripping" programs will download the metadata for all songs on a CD from a standard library.

I mentioned Exact Audio Copy (EAC), above. There's a set of standard profiles for EAC called the "uberstandard", for example (see http://www.ubernet.org). If that's used, and if the person doing the copying uses the built-in music library access function, he or she will produce mp3's with standard output and standard metadata that will be identical in all respects to that produced by anyone else using the same features.

Anonymous said...

Some thoughts.

1. can a screenshot constitute evidence of anything? haven't they got anything better?

2. if 2 files have the same md5 hash you can pretty much bet they are identical. If the 2 files merely have the same mp3 metadata then you cannot be so sure.

3. hashes do not indicate the source of the file.

4. surely for the RIAA to show copyright infringement, they needs to show that the defendant distributed a copy of the copyrighted work to someone who didn't have a license for it. How on earth can hashes or screenshots have any relevance for this?

I think that if they want to use oldfashioned copyright laws that predate the computer era then they need oldfashioned standards of evidence: i.e. proof that person A distributed file to person B (who is not an RIAA agent) and furthermore that person B did not have a license for the file (e.g. perhaps he owned the CD in question and was just downloading the mp3 for a backup).

Emil

Anonymous said...

Courts hear evidence from professionals when a case is technical in nature, like for example with medical evidence, so why is this case any different?

It's like saying judgements can be made in cases where there are technical medical arguments, without bringing in professionals to give evidence.

I can't see how a judge is equipped technically to make decisions of this nature without the technical aspects being evaluated by an independent.

Anonymous said...

I am assuming that the hash codes being talked about are of the entire mp3 file and not just the frames of the mp3 audio data itself.

If those hash codes are the same, there is a finite probability that it is the same file, however, as stated above, it is not impossible that two different rippers using exactly the same VERSION of the encoder will produce the same 'core' MP3 file.

The 'core' mp3 file (my definition) consists of just the frames of the encoded audio with no extraneous information included. The first frame of MP3 audio data typically includes a lot of 'metadata' regarding the encoding process. The name and version of the encoder is usually documented here. In many cases, that data is also written into the last frame, or several frames, of the 'core' MP3 file. The encoders are not the only programs that store metadata in that first frame. Others, like MP3Gain, store data relating to how a given file has been normalized such that it can be 'unnormalized' back to its original state. Other programs will 'trim' mp3 audio data frames to get rid of silence at the beginning/end of a file.

A higher level of metadata typically included in the MP3 file, and which surrounds the core MP3 file at the beginning and end, are the ID tags (there are two versions of tags, and both are typically present). There is a huge variance in the info that can be stored in those tags, including an image (typically of the cover artwork) if so desired. This is where the user can enter virtually any data about the audio that is desired.

Any one of these various changes will change the hash code (again, assuming that it is of the entire file, not just the audio data frames). I tag my files in a way that I've not seen elsewhere, so they would be somewhat unique. Typical users would not want their files tagged that way and, if they ever were to get hold of them through some nefarious means, would probably change them. The underlying mp3 audio data is the same; the hash code will not be.

Removing all of the tags would result in a 'core' mp3 file. At that level, same CD, same version of encoder, same settings, no add-on processing, etc. would theoretically produce same core mp3 file with the same resultant hash. AFAIK, the os has nothign to do with this. Having the same hash code neither proves/nor disproves that the files are the same.

At the higher level, with the addition of all of the levels of permutations of labeling/processing, the same hash code would suggest there is a small probability that the files are the same. I would submit that this claim could neither be proven or disproven, however.

raybeckerman said...

Is it fair to say that the consensus of opinion is that
(a) the judge should have heard technical evidence rather than take the RIAA's word for it, (b) the metadata could not have established who copied the song onto the disk drive, (c) the hashes could not have established that 2 different users obtained the file from the same source, (d) the RIAA could not have known from the metadata that John Doe was copying the files from other P2P users? I.e., are you saying the judge was wrong?

Anonymous said...

I would say that's a fair summation ray.

And yes i would say the judge is not only wrong, but that he does not possess the technical evidence to make that decision.

Neither Hash or Metadata information can give an absolute guarentee of the source, or if it's a copy or original. That alone proves enough reasonable doubt of the evidence.

Anonymous said...

Let me point out that even if what the judge said isn't true in the general case (and it isn't) It could be true in a specific case. I'd have to look at exactly what metadata they saw, and what it consisted of. But it is easy to figure out a set of data that would indicate either multiple copying sources, or someone delibaretly(sp) trying to look like that case. Depending on what the judge was deciding, this may be sufficient evidence for the purposes at hand. (though I wouldn't expect it to be enough to convict someone who's providing a defence)

--
-billy-

Jonathan said...

Fair warning: I'm an electronics engineer, not a computer scientist. I'm also not the person to call as an expert witness; you should use someone with better credentials than I. But from what I've read, the following answers are true. This is intended to point in the right direction.

(a) the judge should have heard technical evidence rather than take the RIAA's word for it

In my opinion, yes.

(b) the metadata could not have established who copied the song onto the disk drive

Absolutely not. There's nothing in the mp3 metadata that includes information on how a file got onto a hard drive, or into a shared folder. The most you might be able to say is when the file was copied there and when it was last accessed.

(c) the hashes could not have established that 2 different users obtained the file from the same source,

You can talk probabilities of whether a song recorded from a CD using unusual settings would be likely to be recorded that way more than once by different people with their own copies of the CD, but overall, your statement above is true.

If two people start off with store-bought copies of the same CD and both make recordings of it, it's very possible they'll end up with identical mp3 files and identical hashes.

(d) the RIAA could not have known from the metadata that John Doe was copying the files from other P2P users?

Again, you can talk probabilities, but it's true there's nothing definitive about the metadata or hashes that would establish with certainty that John Doe copied the files from other P2P users.

Maybe that's too nebulous; I'm an engineer, and love to give caveats. But from everything I've read, it's entirely possible (indeed, likely, depending upon how it was recorded -- there are just a few encoding systems out there that everyone uses) that a song could have been recorded identically by two different people.

Moreover, there's nothing in the mp3 file's metadata that would establish how the file got there, or by whose hand. Anyone can copy a file into a shared directory on a computer by multiple means (e.g. from a USB drive, for example); there's nothing about a p2p program that prevents that.

Worse: mp3 metadata can be altered at will by using commonly available software.

The one thing that might be of help in determining how an mp3 got into a shared folder would be whether or not a filesharing program (e.g. Kazaa) keeps logs of what happens with it. This might be a means to establish where the file came from, if it had been downloaded.

By the way: here's a good reference to what's in the metadata:

http://www.searchtools.com/info/mp3-search.html

Anonymous said...

(a) the judge should have heard technical evidence rather than take the RIAA's word for it

Absolutely. The Judge erred when he didn't consult independent technology experts. I sincerely doubt the Judge understood all the technologies in use.

(b) the metadata could not have established who copied the song onto the disk drive,

Correct. The metadata does not identify any of that information. Further, an examination of the defendent's computer would only yield the date and time the file was created (which can be misleading), the date and time it was last accessed, and if the filesystem is NTFS, the user who created or "owns" the file. I cannot stress enough that this information (everything from "Further" onwards) is (generally) not available remotely to other P2P users. Obviously this depends on the P2P technology being used, but as far as I know all existing P2P technology keeps that information private.

(c) the hashes could not have established that 2 different users obtained the file from the same source

Correct again.

(d) the RIAA could not have known from the metadata that John Doe was copying the files from other P2P users?

Also correct. The metadata is just that, meta information about the data in the file. Whether or not that information is accurate or trustworthy is another matter. And it certainly contains no information as to the source of the data, where (or if) it's been transferred, and so on. As far as I know nothing stores that kind of information.

I.e., are you saying the judge was wrong?

Absolutely, positively. And for what it's worth, I work in software, so I like to think I understand all this pretty well. ;)

Anonymous said...

Ray, Fatal Flaw again. Anyways, I just want to say that the Judge had no buisiness making that decision if he did not already posess the technical means and know how of exactly how mesh and hash work. At the very least an expert should have been brought in to explain it to the judge, an unbiased third party source preferably, so that the judge could surmise the situation appropiately. If I were the RIAA and told the judge the giggaflop interfaces with the koolmo-d on aft port 795 when the snarflog hits 2 and thus clearly proves that Santa Clause murdered JFK, does that mean he should take my word on it without knowing EXACTLY what it meant? I tell ya, if I ever get arrested, I want this judge to oversee my case. I'll bullshit myself circles around him and get off quicker than O.J.

The mesh and hash mean absolutely nothing and should not even be admissible in court becuase not only is it not "tamper proof" but it cannot prove ANYTHING conclusively, and definatly should not be grounds for a conviction or even sway the decision in the general direction of a conviction.

Fatal_Flaw said...

I'll even go one step further to say that ANY judge making this much of an uninformed decision when peoples lives (not life and death, but life in general) are on the line, he's a hazard to himself, those around him, and the judicial system especially in this technological day and age. He should be forcefully removed from the bench and stripped of all titles immediatly before anyone else appears before him.

raybeckerman said...

Thank you. This has been most informative. And it demonstrates
the danger of a 1-sided lawsuit, where the judge doesn't get to hear from both sides.

Fatal_Flaw said...

Holy crap! your up late arent ya? :)

Julian Bond said...

To take the simplest possible view which may well be where the RIAA is coming from. Let's say you're a dumb user who doesn't know much about MP3 ID tags and never changes them. And you downloaded lots of files via Kazaa. it's quite likely that your collection of files will have things like "A1 R1pping Cru3" in the comment tag and they'll all be different. I would say that this was pretty good circumstantial evidence that you were what I described. "A dumb user, knowing nothing about MP3 tags using Kazaa."

Anonymous said...

The thing is julian, even if this person is a dumb user, the fact that mp3-ID3-Tag-Information(Metadata) can be changed so readily by anyone means it's inherently unreliable as evidence.

It would be like using dna evidence that only had a 5% chance of being correct.

Any kind of link you try to establish between Metadata and users is tenuous at best.

Jonathan said...

What's more, the presence of such files on your computer simply doesn't prove that the files were downloaded via Kazaa.

If you were talking about a "dumb user" with such files in his or her shared p2p directory, would I suspect that the files had been downloaded? Yep. But it wouldn't be too hard to paint alternate pictures of how the files got there.

Anonymous said...

The hashing algorithms are used by file sharing programs to ID a file uniquely with a short string of alphanumeric characters. Some hashing algorithms are better than others, and their functionality is measured by the possibility of two different files colliding (having the same hash ID).

The hashing algorithm employed by Kaaza (Fasttrack network) is broken. They use the UUHash algorithm. While UUHash allows very large files to be checksummed in a short time, even on slow computers, it also allows for massive corruption of a file to go unnoticed. In fact, the RIAA affiliates have exploited this weakness to flood the network with corrupted files (spoofs) that have the same hash as the original file. ID’ing a file with this algorithm is unreliable. You can read more about it here:

http://en.wikipedia.org/wiki/UUHash

The hash algorithm merely creates an ID for a file. It does not tell you if the file was uploaded or downloaded or how it was created. Two files that differ by only one bit should have very different hash ID’s. Thus, just changing one character in the metadata of an mp3 should result in a different ID. However, if the same software was used to rip two copies of the same CD with the same type of hardware and the same settings in the software, the two mp3’s should be identical and thus have the same hash ID. Thus, you couldn’t rightly say if the file was uploaded or downloaded. Also, if two files have the same hash, one could have been copied from the other by any means. They didn’t necessarily have to be uploaded or downloaded from the internet.

Metadata are tags stored into mp3s that identify the file’s artist, song title, etc. Sometimes, the software that rips a CD may store a tag into the file. Other times the ripper may put his/her mark on the file by adding a tag. These tags can be added/deleted by anyone so I don’t see how this can be used to ID a file. Metadata is certainly less secure than a hashing algorithm for identification purposes.

Jonathan said...

The hashing algorithm employed by Kaaza (Fasttrack network) is broken. They use the UUHash algorithm. While UUHash allows very large files to be checksummed in a short time, even on slow computers, it also allows for massive corruption of a file to go unnoticed. In fact, the RIAA affiliates have exploited this weakness to flood the network with corrupted files (spoofs) that have the same hash as the original file. ID’ing a file with this algorithm is unreliable. You can read more about it here:

Ray, this is a HUGE deal. What anonymous posted, above, completely invalidates the use of hashes to uniquely identify music files on Kazaa.

If the hashing algorithm is indeed broken, and if the file can, in fact, be very different while having the same hash (and worse: two people can deliberately create files with the same hash), then the hash is literally useless.

This isn't how digital hashing is supposed to work.

raybeckerman said...

This has been an incredible learning experience.

Make sure that those of you who can do join up with the Electronic Frontier Foundation's "cooperating techs" list.

https://falcon.eff.org/mailman/
listinfo/effcooperatingtechs

Thanks to all of you for sharing your expertise and your insights.

Best regards,

Ray

Anonymous said...

The people who reverse engineered the UUHash algorithm also created the sig2dat extention that allows Kazaa links to be posted onto websites. These guys are truely experts on UUHash, and I'm sure if you contact them, they would be willing to help.

Here's their website:

http://www.geocities.com/vlaibb/tools.html

Anonymous said...

It's ironic how the RIAA's own tactics of flooding p2p networks with fake files invalidates their own arguments in this case.

Anonymous said...

finially you are starting to think the right way.

Remember there are a few key facts here.

-computer files are just numbers. (ok they are really really big numbers but hey)

-numbers can be represented in an infinate number of ways.

-as such the same computer file can be represented in infinate diffeent ways.

-and the hard one, one computer files also represents an infinate number of other files.

In short, no hash, metadata, filesize or anything else you can think of can garuntee that a file is the same as another file, or even that its the same as a contested copyrighted song.

if you tell a computer to display an mp3 as an image, it will do (providing you sort the problem of it throwing a fit at incorrect file headers.) you can easilly display an mp3 in a text editor.

So the question i now ask, is that text and is that iomage also copyrighted to the artist who wrote + performed the music?

Anonymous said...

If I have a dead body in my house is that a PROOF that I killed that person???? Since when?????

Anonymous said...

...second part or my previous comment...

- The dead body could be really well dead
- Yes, maybe it looks like a killing
- Yes, it is in my house where I am the owner and where I live and therefore everything that happens in there is at first sight my responsibility.

Based on the presence of the dead body in my house, it is a PROOF that I commited the ACTION of killing????

This is not a technical question, is not a matter of files, formats, etc. Is far simpler than that, is a huge fallacy.

Anonymous said...

...third part of the "dead body" issue...

In fact even if I indeed killed that person, the body in my house is never in itself a proof of anything!

Even If I indeed downloaded the files "illegally", a screenshot can not and is not a proof of action of any kind, never.

Or am I missing something here?