FAQ  •  Register  •  Login

Data analysis development-wishlists and datasets?

Forum rules
Please be as geeky as possible. Reference, reference, reference.
Also, please note that this is a mixed bag of math-gurus and mathematically challenged, so choose your words wisely :-)
<<

mleipold

Guru

Posts: 5796

Joined: Fri Nov 01, 2013 5:30 pm

Location: Stanford HIMC, CA, USA

Post Wed Nov 08, 2017 4:08 pm

Data analysis development-wishlists and datasets?

Hi all,

On another thread (viewtopic.php?f=3&t=823), Adeeb made a comment about having some datasets in preparation to help out with the informatics problem of doublet discrimination/exclusion. This was partly in response to El-ad's comments (Wed Oct 25, 2017 3:32 pm) about not having quite the right dataset to do some of this modeling.


It got me thinking a bit about datasets, and what various labs might have "lying around" that would be helpful in things like this. While FlowRepository and Immport (among others) are great, I'm sure about every lab has some data that would be useful to *someone* doing algorithm development, benchmarking, etc, even if it's not necessarily publishable by itself.

For example, the HIMC runs a healthy control sample on almost every plate of samples we process, so we have replication data that we haven't really done a ton with, aside from using it as qualitative QC to make sure that the assay performed properly. But, it could potentially be useful to some data scientist looking at day/week/month/year variation over in-house reagent lots, different instruments (specific instruments vs specific instrument, instrument models, etc), etc.


On the flipside, I've also encountered a few instances where a data scientist was surprised by a problem or analysis need that I raised: I don't think the bench scientists are necessarily communicating *that* back to the data people.



So, 2 questions that I hope will spark a lot of discussion:

1) Data scientists, bioinformaticians, and algorithm developers (and combinations thereof): for the problems/issues you see in CyTOF data analysis, what datasets do you need (or at least want) from the bench scientists in order to better tackle them? In other words, what's your wishlist?

2) Bench researchers: what do you see as some still-unmet needs in data analysis? I'm thinking more about unsolved problems: things like A) the clean doublet discrimination mentioned above; B) detection and hopefully elimination of sample carryover; C) batch correction (both for barcoded samples *and* non-barcoded samples with a plate control). In other words, what's your wishlist?


I'm mainly thinking of datasets that might be already acquired that either haven't (yet) been made publicly available, moreso than from-scratch creation. But that would also be an option for the field as a whole, and probably a good chance for new research collaborations!


Mike
<<

bc2zbUVA

Contributor

Posts: 22

Joined: Thu Nov 19, 2015 4:23 pm

Post Wed Nov 08, 2017 4:11 pm

Re: Data analysis development-wishlists and datasets?

If I may add one question to bench scientists from the developer perspective, how do you want results delivered to you? Do you just want the figures and tables of statistics, do you want each cell/event annotated with the results of the analysis, or something in between?
<<

sfauteux

Participant

Posts: 17

Joined: Tue Feb 23, 2016 2:18 pm

Post Wed Nov 08, 2017 4:31 pm

Re: Data analysis development-wishlists and datasets?

One thing we encountered a lot while implementing a new pannel was channel name/parameter discrepencies which some softwares don't like.

For this reason It would be great to have a small user-friendly tool for R challenged people (we are numerous, we know). This file modifier tool could include the possibility to change channel name/parameters, modify file structure/format easily (FCS to CSV back to FCS for exemple). The idea is pretty general but it seems that there's a certain need for file manipulation that is not met yet (at least that I know of) in a user-friendly, time-wise manner.

I know these tools are easily available on R and in some extent you can work your way around in flowjo, but in the actual format is both time consuming and not very practical (For exemple information is lost in CSV exporting in flowjo // benchpeople are often not willing to learn R).

It is absolutely possible that these tool exists, but I don't know them. Hope this helps.
<<

bc2zbUVA

Contributor

Posts: 22

Joined: Thu Nov 19, 2015 4:23 pm

Post Wed Nov 08, 2017 4:36 pm

Re: Data analysis development-wishlists and datasets?

"One thing we encountered a lot while implementing a new pannel was channel name/parameter discrepencies which some softwares don't like."

Could you elaborate on this? Are you making your own csv files then trying to coerce them into FCS format?

I think I already have what you want. It sounds like you want to take a FCS file, extract the parameter table as a csv, edit it (i.e. change "Sm149Di" to CD45), then update the parameters of the FCS with that edited csv. Now, when you open the FCS up in your FlowJo/FCS express, the channels are listed as CD45 rather than Sm149Di.
<<

mleipold

Guru

Posts: 5796

Joined: Fri Nov 01, 2013 5:30 pm

Location: Stanford HIMC, CA, USA

Post Wed Nov 08, 2017 4:40 pm

Re: Data analysis development-wishlists and datasets?

Hi Brian,

As a bench scientist, I'd like to have as much annotation as I can :)

For example, I've really been appreciating cytofkit and some other packages which write results like the tSNE parameters and Cluster IDs into the output daughter FCS files. That allows me to go back to FlowJo and really understand exactly what makes Phenograph Cluster #20 different from the others. As well as allows me to regate that from the original files (ie, the *input* parent files) to give me a sanity check on the clustering results....we've all had spurious clusters!

Fundamentally, though, I think there's probably a bit more interest from the bench researchers about just getting results (ie, "just show me what's different!). That's one reason why so many people are interested in things like Citrus and Statistical Scaffold.


One request for developers: there seem to be relatively few algorithms for analysis of timecourse/longitudinal data (some examples: 10.4049/jimmunol.1700421 and 10.1126/sciimmunol.aan2946). For example, vaccine and challenge studies might have a pre-vacc pre-challenge timepoint, then a few post-vacc pre-challenge timepoints, then a few post-vacc post-challenge timepoints. And of course some people are going to have an immune response, while some aren't. And some are going to be protected (fully or partially), while some aren't.

Or, you may have multiple draws from a person over time, to look at circadian/daily/weekly/monthly/yearly variations in immune profile, to better understand the range of "normal".

In both those cases, due to biological variability in people, each person has to be their own control (partially or fully), often setting Pre-vacc Pre-challenge or Day 1 Draw 1 as the baseline.


But, for example, I'm not sure how to set up that analysis in Citrus, even if I have adequate numbers of responders and non-responders to make *that* differentiation.


Mike
<<

sfauteux

Participant

Posts: 17

Joined: Tue Feb 23, 2016 2:18 pm

Post Wed Nov 08, 2017 4:48 pm

Re: Data analysis development-wishlists and datasets?

To be specific,

At some point, we added markers to our panel, which cytobank did not handle (i don't know why, it usually does).

The very unelegant way we worked around this was to export FCS into CSV (flowjo). Deleted the column and reconvert in FCS again to work with the file (drag and drop in flowjo).

Some of the problems we encoutered

1: the P$N parameter was lost. So we got each labeled named alphabetically "Parameter 1 - CD45"
2: The labeling order is exported alphabetically

The operation was painfull and clearly not as elegant as it could be otherwise. But this made me realize that we could use the equivalent of a Microsoft Excel for flowjo file.

This would allow

Easy file handling and creation of toy dataset
<<

bc2zbUVA

Contributor

Posts: 22

Joined: Thu Nov 19, 2015 4:23 pm

Post Wed Nov 08, 2017 4:49 pm

Re: Data analysis development-wishlists and datasets?

Having looked at Citrus in depth, it does support time course analysis, but last time I worked with it (before it was available in cytobank), it wasn't immediately apparent from the documentation how to set it up for a time course analysis.

When I need to run a time course analysis on cytometry results, I use the methods presented in the edgeR and limma user guides, treating subpopulation counts as transcript counts. Each subpopulation as identified by your favorite clustering algorithm is analogous to a gene from your reference genome. With that framework in place, you can treat your cytometry data like transcriptome data. Cydar is essentially a guide to using edgeR on cytometry data if you haven't looked at it in depth.
<<

mleipold

Guru

Posts: 5796

Joined: Fri Nov 01, 2013 5:30 pm

Location: Stanford HIMC, CA, USA

Post Wed Nov 08, 2017 4:52 pm

Re: Data analysis development-wishlists and datasets?

Hi Brian,

I think part of what Sebastien is talking about can be addressed by some of the parameter harmonizing powers of cytofCore (https://github.com/nolanlab/cytofCore#E ... _FCS_Files) and Premessa (https://github.com/ParkerICI/premessa#p ... d-renaming).


But, for example: I ran a six-center CyTOF comparison study (poster at CYTO2016; paper revisions have been submitted, so it should be published soon along with *ALL* the FCS files).

Of the 6 sites, 3-4 of them had *something* not quite consistent with the Marker+Metal+Mass naming. One site had CD66a rather than CD66 like the other 5 sites. One site had acquired Lu176 rather than Yb176; right mass, so the data's there, but all programs rejected throwing them all into the same analysis. Two sites acquired multiple extra channels (not a huge deal, but still more stuff to wade through), including some "duplicates" like both Nd142 (marker channel) and Ce142 (bead channel). So, a harmonizer really *MUST* treat Marker, Metal, and Mass as three *separate* issues to deal with.

To get the files ready for clustering, I had to deal with all these issues, which took a couple rounds (eg, fixing the Lu/Yb problem first, then taking *those* files into the next round of harmonizing).

The Helios software will help this in the future: panels are available as template *.tem files, which you can share with people or move between computers. But that study was on v2 instruments, which didn't have similar capabilities.


So, yes, a versatile, user-friendly harmonizer would be a simple but highly useful addition. And, yes, we bench scientists like our GUIs, so that would be preferable ;)


Mike
<<

dtelad11

Master

Posts: 129

Joined: Mon Oct 31, 2016 6:26 pm

Post Wed Nov 08, 2017 6:02 pm

Re: Data analysis development-wishlists and datasets?

mleipold wrote:So, yes, a versatile, user-friendly harmonizer would be a simple but highly useful addition. And, yes, we bench scientists like our GUIs, so that would be preferable

We actually developed something similar at the HIMC. Unfortunately it had the side effect of removing the super-secret FCS section which Fluidigm puts in the CyTOF FCS files. Usually that would not be a problem, but we wanted to run the tool before Fluidigm normalization, and losing the super-secret section broke that. I might be able to build it into a nice GUI, let me check.
<<

vtosevski

Contributor

Posts: 44

Joined: Wed Nov 20, 2013 12:50 pm

Location: Zurich, Switzerland

Post Sun Nov 12, 2017 8:06 pm

Re: Data analysis development-wishlists and datasets?

mleipold wrote:For example, I've really been appreciating cytofkit and some other packages which write results like the tSNE parameters and Cluster IDs into the output daughter FCS files. That allows me to go back to FlowJo and really understand exactly what makes Phenograph Cluster #20 different from the others. As well as allows me to regate that from the original files (ie, the *input* parent files) to give me a sanity check on the clustering results....we've all had spurious clusters!

A lot of our users needed the same so we made that bit in R and is pretty streamlined and efficient. I've been sending code examples around, you could also check the webinar I gave recently where this workflow has been shown in practice.
https://www.youtube.com/watch?v=je4RGJZlO98

Vinko
Next

Return to CyTOF data analysis

Who is online

Users browsing this forum: No registered users and 12 guests

cron