### 2018-Cosma-Cytometry A

Antonia Cosma

Cytometry A, 2018,

DOI: 10.1002/cyto.a.23346

11 posts
• Page **1** of **2** • **1**, 2

"Universal cell type identifier based on number theory"

Antonia Cosma

Cytometry A, 2018,

DOI: 10.1002/cyto.a.23346

Antonia Cosma

Cytometry A, 2018,

DOI: 10.1002/cyto.a.23346

Hi all,

We can move this discussion over to Data Analysis if we want, but I think it might be useful to keep the initial discussion tied to the paper.

I think this is an interesting idea, giving a cell type a "number" based on the prime numbers assigned to the CD markers it expresses. This could allow faster searches for similarity with cells found in other papers, without completely limiting it to only exactly the same markers.

The paper does expand on ways to accommodate "what if the marker I'm using doesn't have a CD number?" and "what if I have mid/low rather than pos/neg?"

I guess my only real comment is that it potentially has lower "resolution" about signal intensity than Jonathan Irish's Marker Enrichment Modeling (DOI: 10.1002/cpcy.34 ; doi:10.1038/nmeth.4149).

Thoughts? Especially from computational people who would be integrating such searches into workflows....

Mike

We can move this discussion over to Data Analysis if we want, but I think it might be useful to keep the initial discussion tied to the paper.

I think this is an interesting idea, giving a cell type a "number" based on the prime numbers assigned to the CD markers it expresses. This could allow faster searches for similarity with cells found in other papers, without completely limiting it to only exactly the same markers.

The paper does expand on ways to accommodate "what if the marker I'm using doesn't have a CD number?" and "what if I have mid/low rather than pos/neg?"

I guess my only real comment is that it potentially has lower "resolution" about signal intensity than Jonathan Irish's Marker Enrichment Modeling (DOI: 10.1002/cpcy.34 ; doi:10.1038/nmeth.4149).

Thoughts? Especially from computational people who would be integrating such searches into workflows....

Mike

Very elegant approach, thanks for posting this. I don't have a huge issue with loss in resolution, as the potential gain in the number of markers being stored at once seems like it would be worthwhile. I haven't run into a search issue yet that hasn't been solvable by more elegant coding, but I will keep this approach in mind should it ever happen. I'd love to see it applied to scRNA seq, where we run into these sorts of issues a lot more frequently. Currently, I'm encoding immunophenotypes using the phenocodes approach from the FlowType R package, but that varies from experiment to experiment. The thing I appreciate about this approach is that every ID that would be annotated could be derived regardless of the panel. My biggest concern would be how this handles extremely large phenotypes, though apparently we can get up to 20,000 markers easily enough.

However, storing those digits is going to become a huge drain on memory at some point.

Storing thousands of digits per cell is going to be extremely memory hungry. Whereas using the phenocode method, a 401 parameter signature only requires 401 digits. Again, the largest issue is search time, and I would be interested to see search comparisons of a phenocode vs the UNN approach laid out here.

Of note, the 20,244th prime is a six-digit numberstill falling in the range of the elliptic curve factorization method and the software package described above.

However, storing those digits is going to become a huge drain on memory at some point.

To demonstrate the capacity to handle big integers, I multiplied all the primesassociated to the 401 CD markers shown in SupportingInformation Table 1 to obtain a UNN of 1,177 digits

Storing thousands of digits per cell is going to be extremely memory hungry. Whereas using the phenocode method, a 401 parameter signature only requires 401 digits. Again, the largest issue is search time, and I would be interested to see search comparisons of a phenocode vs the UNN approach laid out here.

I would also like to mention some practical benchwork/experimental issues: anything that would cause an experimental artifact would potentially affect this coding. For example, freeze-thaw, fixation, methanol-perm, certain types of stimulation, etc are known to affect expression levels of certain markers.

In my experience, though, most of these usually wind up as negatives (eg, False Negative where antibody doesn't bind because you messed up its epitope, or True Negative, like CD4 downregulation upon certain strong stims). Therefore, if I'm understanding Cosma's plan, you just wouldn't multiply in that associated prime, and it would give you the same product as just not including the marker in the first place.

However, there are definitely cases of False Positives, which would potentially be more problematic. One example would be the increased CD14/CD16 binding to negatives from a pre-mixed cocktail that my coauthors and I saw in the Multicenter paper (https://doi.org/10.1016/j.jim.2017.11.008)

I'm not saying there aren't ways to deal with this, but better to keep it in mind from the beginning.....

Mike

In my experience, though, most of these usually wind up as negatives (eg, False Negative where antibody doesn't bind because you messed up its epitope, or True Negative, like CD4 downregulation upon certain strong stims). Therefore, if I'm understanding Cosma's plan, you just wouldn't multiply in that associated prime, and it would give you the same product as just not including the marker in the first place.

However, there are definitely cases of False Positives, which would potentially be more problematic. One example would be the increased CD14/CD16 binding to negatives from a pre-mixed cocktail that my coauthors and I saw in the Multicenter paper (https://doi.org/10.1016/j.jim.2017.11.008)

I'm not saying there aren't ways to deal with this, but better to keep it in mind from the beginning.....

Mike

Finally, with a similar principle a more detailed definition could be achieved for population defined as expressing a marker at a “medium”/ “low” level by simply cubing the prime

This statement is wrong. You cannot process the same way to have a more detailed definition of cell phenotypes.

Here is an issue based on the author example, if CD127 (983) had 3 levels of definition:

19*23*151*277*983*983*983 = 17361958221158712

but:

17361958221158712 = 2*2*2*3*7*977*3023*34991029

This demonstrates that the PPS system cannot handle one of the most complicated problems when defining cell populations: Cell phenotypes are not defined in term of the absence or presence of cell markers, but rather in term of gradients of cell marker expressions. It is quite scary that the editor or the reviewers did not notice that issue…

Hi,

I discussed with Antonio about handling levels. If I remember correctly, this is possible using powers of the prime numbers.

I will ask him to register this forum and to answer your questions.

Nevertheless, I think that levels should not be too numerous, otherwise assignation will become very subjective.

Cheers.

I discussed with Antonio about handling levels. If I remember correctly, this is possible using powers of the prime numbers.

I will ask him to register this forum and to answer your questions.

Nevertheless, I think that levels should not be too numerous, otherwise assignation will become very subjective.

Cheers.

Dear Cytoboy,

Some simple math.

If you multiply odd numbers you get an odd number as result. So your first calculation is simply wrong from the beginning.

I advise you to calculate correctly and you will see the PPS works.

Always calculate twice before telling something is wrong!

Regards,

Antonio Cosma

PS: I would like to add that, apart the calculation error, your are challenging the fundamental theorem of arithmetic and will be really scaring if in more than 2000 years nobody realised this problem

Some simple math.

If you multiply odd numbers you get an odd number as result. So your first calculation is simply wrong from the beginning.

I advise you to calculate correctly and you will see the PPS works.

Always calculate twice before telling something is wrong!

Regards,

Antonio Cosma

PS: I would like to add that, apart the calculation error, your are challenging the fundamental theorem of arithmetic and will be really scaring if in more than 2000 years nobody realised this problem

bc2zbUVA wrote:

Yes i guess, but wouldn't this be solved by using a dictionary? It's not like every cell has a different cell type...

Sturla

Storing thousands of digits per cell is going to be extremely memory hungry.

Yes i guess, but wouldn't this be solved by using a dictionary? It's not like every cell has a different cell type...

Sturla

Hi all,

First of all: let's please keep comments friendly.

Second: I think there might be a computational problem that's raising its head here. A lot of computer programs limit the number of significant figures involved in calculations.

For example, Excel truncates somewhere around 15 significant figures:

http://precisioncalc.com/what_is_xlprecision.html

https://stackoverflow.com/questions/344 ... flow-error

You can see this in my attached Excel example:

Everything is going OK until the final multiplication by 983, then you get a case where a number ending in "1" gets multiplied by a number ending in "3" and you result in a number ending in "0".......which I think we can all agree is incorrect. Doing it on paper by hand, I come up with a number "....,158,713" rather than ".....,158,700".

Assuming this is what's causing the issue today: this would be something that would have to be taken into CAREFUL account when writing programs to do these computations.....numerical truncation errors would make this method completely useless.

Mike

First of all: let's please keep comments friendly.

Second: I think there might be a computational problem that's raising its head here. A lot of computer programs limit the number of significant figures involved in calculations.

For example, Excel truncates somewhere around 15 significant figures:

http://precisioncalc.com/what_is_xlprecision.html

https://stackoverflow.com/questions/344 ... flow-error

You can see this in my attached Excel example:

Everything is going OK until the final multiplication by 983, then you get a case where a number ending in "1" gets multiplied by a number ending in "3" and you result in a number ending in "0".......which I think we can all agree is incorrect. Doing it on paper by hand, I come up with a number "....,158,713" rather than ".....,158,700".

Assuming this is what's causing the issue today: this would be something that would have to be taken into CAREFUL account when writing programs to do these computations.....numerical truncation errors would make this method completely useless.

Mike

laustr wrote:bc2zbUVA wrote:Storing thousands of digits per cell is going to be extremely memory hungry.

Yes i guess, but wouldn't this be solved by using a dictionary? It's not like every cell has a different cell type...

Sturla

Well if using a dictionary, the question becomes which dictionary is more efficient? The one where every immunophenotype is encoded by a vector of equal length, or one where every immunophenotype is encoded by a vector of variable length. I've been to a few lectures where they optimized compression of genomic variant lookups by sorting the variants based on frequency, and I could see an argument for doing a similar approach with the prime. The more frequent a marker is used to phenotype, the lower its prime value should be. However, this is beyond my expertise in data structures. I'll have to grab the table from the paper and use it to create dictionaries for some of my older analyses.

11 posts
• Page **1** of **2** • **1**, 2

Users browsing this forum: No registered users and 3 guests