Page 1 of 2

Modifying channels from the FCS files

PostPosted: Thu Sep 26, 2019 9:41 pm
by avinash1
Hi CytOFers! I have removed 5 extra unused conflicting channels from FCS files that were run across multiple time points and re-exported from Premessa. Interestingly, the file size has dropped half but the event counts still remain same. Just curious do you expect this ? Is this normal ? Why has the file size halved ?

Cheers
Avi

Re: Modifying channels from the FCS files

PostPosted: Thu Sep 26, 2019 10:01 pm
by dtelad11
FCS files that come from the CyTOF have an additional section that is used by the Fluidigm software for normalization. Personally, I fondly refer to it as the "super-secret-Fluidigm-section". If you manipulate the file outside of the Fluidigm software (via FlowJo, Cytobank, R, etc.) that section will be discarded, cutting the file size by half.

Re: Modifying channels from the FCS files

PostPosted: Thu Sep 26, 2019 10:05 pm
by avinash1
Thanks EL ad. Just wanted to confirm if that will not hinder the data in anyway ?

Avi

Re: Modifying channels from the FCS files

PostPosted: Thu Sep 26, 2019 10:28 pm
by dtelad11
Make sure to normalize with the Fluidigm software ahead of any downstream analysis. Assuming you've done that, you're safe.

Re: Modifying channels from the FCS files

PostPosted: Thu Sep 26, 2019 10:28 pm
by dtelad11
Make sure to normalize with the Fluidigm software ahead of any downstream analysis. Assuming you've done that, you're safe.

Re: Modifying channels from the FCS files

PostPosted: Thu Sep 26, 2019 10:37 pm
by avinash1
Yes I have done that originally when generating data from the machine.

Avi

Re: Modifying channels from the FCS files

PostPosted: Sun Sep 29, 2019 8:23 pm
by vtosevski
Hi Avi and El-ad,

I don't think it's the additional section. I've been looking for that mythical section long time ago as I too was told it was there (the randomised and non-randomised matrix). To me it made sense that it should be there but I was never able to see it myself. El-ad, have you seen it? :)

Mike Jiang from RG lab@Fred Hutch told how to find a second matrix (if it's there) and it wasn't. Instead, he suggested the size difference most likely has to do with the number of bits per data point (see the old thread here: https://support.bioconductor.org/p/109258/

I just realized I never closed that thread but if memory serves me well, I followed their advice and could confirm it to be the case.

Vinko

Re: Modifying channels from the FCS files

PostPosted: Sun Sep 29, 2019 10:46 pm
by dtelad11
Vinko, I'm confused by your reply -- there is nothing "mythical" about this section. Open the FCS file with a hex editor, you can scroll to the end of the file and see XML tags. Additionally, check out the header, there is a definition of a "user-defined OTHER segment" following the other segments such as TEXT and DATA, as per the FCS standard. The existence and utility of that segment was confirmed by Fluidigm reps and by FlowJo personnel.

Re: Modifying channels from the FCS files

PostPosted: Sun Sep 29, 2019 11:02 pm
by ChrisCiccolella
My vote is also with the additional section. The FCS spec does allow for 64 bit encoding (typical is 32 bit) but I don't think I've ever seen it implemented despite intentionally looking for examples. This is with good reason since it offers an unnecessary degree of numeric precision while doubling the amount of storage needed to encode the data. Why do that? On the other hand, I have indeed seen extra stuff pasted to the end of Fluidigm FCS files.

Anyway, there is no sense in speculating because it's easy to determine the answer:

Start by opening your FCS file with a text editor or hex editor.

To know the bits per value, look at the $PnB keyword for each channel with the text segment at the beginning of the file. E.g., $P4B gives the bits per value of the fourth channel. This will normally be 32, but if it's 64, then that would explain the doubling in size. The loss of half the data size comes from reading in the data as 64 bit then writing it out again as 32 bit, which likely all software is hard-coded to do.

You can also look at the $DATATYPE keyword. If this has a value of F (for "float") then you should have 32 bit encoding. A value of D (for "double") means 64 bit encoding. Besides F I think I have only ever seen I (for "integer"). If I recall correctly, the YETI used integer encoding in its very early days. I doubt it does this still?

So now about the extra section and how to parse it:

The simplest way to infer its existence is to look at the $BEGINDATA and $ENDDATA keywords in the TEXT segment. Subtract the former from the latter and you have the number of bytes of data that are encoded in the file. Since data should the primary share of the file size, simply comparing this value to the size of the file on disk will give a simple indication of whether or not there is extra stuff in the file. For example, a Fluidigm file I have on my computer has a disk size of 257 MB. When I calculate the theoretical data size I get (84803681-3682) bytes = 84.8 MB. If I read the file into R and write it back out again, sure enough, the written FCS file is 84.8 MB.

The FCS spec also allows for other information to be officially encoded in the FCS file. The byte locations of this information should be given by other keywords such as $BEGINANALYSIS, $ENDANALYSIS, and $NEXTDATA. $NEXTDATA, as your linked thread points out, would be used for another data matrix but this is not used in the example file I'm looking at. So, there is no immediate official explanation for the rest of the information encoded past the stated end of the data in this file.

Another observation to make is that sometimes there is a bunch of XML at the very end of the file. I think this is known as the "XML tail". So, there is clearly precedent for Fluidigm appending things to the end of FCS files. This little bit of XML doesn't explain the large increase in file size, though.

So how do we parse it? The first step is to know ahead of time exactly what the information is and how it's encoded. Then you would have to write a routine that reads in the correct bytes and processes them to the correct data structure. Not sure if this information is public or could be offered by a Fluidigm rep or anyone who knows. Obviously it's liable to change at some point.

@El-ad, I don't see any other byte pointer keywords in the FCS file I'm looking at right now, but that would help pull out the correct bytes for this information.

Re: Modifying channels from the FCS files

PostPosted: Mon Sep 30, 2019 8:27 pm
by vtosevski
Hi both,

@El-ad, I didn't mean anything bad by "mythical". I don't look at FCS files with HEX editors and have never seen the XML tags and user-defined OTHER segment you mention. It felt intuitive that additional matrix should be in there (as CyTOF software can go between randomized and non-randomized matrix back and forth) but I never managed to see it myself, which is why I used the word "mythical". I am not a native speaker of English so if that word has a "heavier" meaning than the one I intended, my bad.

@Chris - thanks for this exhaustive reply. I took some files now as they come out of the instrument and checked, they are indeed encoded with 32-bit precision (and $DATATYPE is "F"). The $NEXTDATA is 0, so the additional matrix it is, in Fluidigm's own way, I suppose!

Vinko